蛋白质语言模型嵌入物折叠预测分析。

An analysis of protein language model embeddings for fold prediction.

机构信息

Department of Signal Theory, Telematics and Communications, University of Granada, Granada, Spain.

出版信息

Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac142.

Abstract

The identification of the protein fold class is a challenging problem in structural biology. Recent computational methods for fold prediction leverage deep learning techniques to extract protein fold-representative embeddings mainly using evolutionary information in the form of multiple sequence alignment (MSA) as input source. In contrast, protein language models (LM) have reshaped the field thanks to their ability to learn efficient protein representations (protein-LM embeddings) from purely sequential information in a self-supervised manner. In this paper, we analyze a framework for protein fold prediction using pre-trained protein-LM embeddings as input to several fine-tuning neural network models, which are supervisedly trained with fold labels. In particular, we compare the performance of six protein-LM embeddings: the long short-term memory-based UniRep and SeqVec, and the transformer-based ESM-1b, ESM-MSA, ProtBERT and ProtT5; as well as three neural networks: Multi-Layer Perceptron, ResCNN-BGRU (RBG) and Light-Attention (LAT). We separately evaluated the pairwise fold recognition (PFR) and direct fold classification (DFC) tasks on well-known benchmark datasets. The results indicate that the combination of transformer-based embeddings, particularly those obtained at amino acid level, with the RBG and LAT fine-tuning models performs remarkably well in both tasks. To further increase prediction accuracy, we propose several ensemble strategies for PFR and DFC, which provide a significant performance boost over the current state-of-the-art results. All this suggests that moving from traditional protein representations to protein-LM embeddings is a very promising approach to protein fold-related tasks.

摘要

蛋白质折叠类别的识别是结构生物学中的一个具有挑战性的问题。最近用于折叠预测的计算方法利用深度学习技术，主要使用多重序列比对（MSA）形式的进化信息作为输入源，提取具有代表性的蛋白质折叠嵌入。相比之下，蛋白质语言模型（LM）通过从纯粹的序列信息中以自监督的方式学习有效的蛋白质表示（蛋白质-LM 嵌入），从而彻底改变了该领域。在本文中，我们分析了一个使用预训练的蛋白质-LM 嵌入作为输入的蛋白质折叠预测框架，该框架将输入到几个经过微调的神经网络模型中，这些模型通过折叠标签进行监督训练。特别是，我们比较了六种蛋白质-LM 嵌入的性能：基于长短期记忆的 UniRep 和 SeqVec，以及基于转换器的 ESM-1b、ESM-MSA、ProtBERT 和 ProtT5；以及三种神经网络：多层感知机、ResCNN-BGRU（RBG）和 Light-Attention（LAT）。我们分别在著名的基准数据集上评估了折叠识别（PFR）和直接折叠分类（DFC）任务的性能。结果表明，基于转换器的嵌入（特别是在氨基酸水平上获得的嵌入）与 RBG 和 LAT 微调模型相结合，在这两个任务中都表现得非常出色。为了进一步提高预测准确性，我们针对 PFR 和 DFC 提出了几种集成策略，这些策略显著提高了当前的最先进结果。所有这些都表明，从传统的蛋白质表示向蛋白质-LM 嵌入的转变是一种非常有前途的蛋白质折叠相关任务的方法。