基于蛋白质语言模型的嵌入来实现快速、准确且无需对齐的蛋白质结构预测。

Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction.

机构信息

TUM (Technical University of Munich), Department of Informatics, Bioinformatics and Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany; TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany.

出版信息

Structure. 2022 Aug 4;30(8):1169-1177.e4. doi: 10.1016/j.str.2022.05.001. Epub 2022 May 23.

DOI:10.1016/j.str.2022.05.001

PMID:35609601

Abstract

Advanced protein structure prediction requires evolutionary information from multiple sequence alignments (MSAs) from evolutionary couplings that are not always available. Artificial intelligence (AI)-based predictions inputting only single sequences are faster but so inaccurate as to render speed irrelevant. Here, we described a competitive prediction of inter-residue distances (2D structure) exclusively inputting embeddings from pre-trained protein language models (pLMs), namely ProtT5, from single sequences into a convolutional neural network (CNN) with relatively few layers. The major advance used the ProtT5 attention heads. Our new method, EMBER2, which never requires any MSAs, performed similarly to other methods that fully rely on co-evolution. Although clearly not reaching AlphaFold2, our leaner solution came somehow close at substantially lower costs. By generating protein-specific rather than family-averaged predictions, EMBER2 might better capture some features of particular protein structures. Results from using protein engineering and deep mutational scanning (DMS) experiments provided at least a proof of principle for such a speculation.

摘要

高级蛋白质结构预测需要来自进化耦合的多个序列比对 (MSA) 的进化信息，但这些信息并不总是可用的。基于人工智能 (AI) 的预测仅输入单个序列，速度更快，但准确性如此之低，以至于速度变得无关紧要。在这里，我们描述了一种竞争性预测，仅输入来自预先训练的蛋白质语言模型 (pLM) 的嵌体，即 ProtT5，从单个序列到具有相对较少层的卷积神经网络 (CNN) 中的残基间距离 (2D 结构)。主要的进展是使用 ProtT5 注意力头。我们的新方法 EMBER2 从不需要任何 MSA，其性能与完全依赖共进化的其他方法相似。虽然显然无法达到 AlphaFold2，但我们更精简的解决方案在成本大幅降低的情况下，在某种程度上更接近。通过生成特定于蛋白质而非家族平均的预测，EMBER2 可能更好地捕捉到特定蛋白质结构的某些特征。使用蛋白质工程和深度突变扫描 (DMS) 实验提供的结果至少为这种推测提供了一个原理证明。