基于多重序列比对训练的蛋白质语言模型可以学习系统发育关系。

Protein language models trained on multiple sequence alignments learn phylogenetic relationships.

机构信息

Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland.

SIB Swiss Institute of Bioinformatics, CH-1015, Lausanne, Switzerland.

出版信息

Nat Commun. 2022 Oct 22;13(1):6298. doi: 10.1038/s41467-022-34032-y.

DOI:10.1038/s41467-022-34032-y

PMID:36273003

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9588007/

Abstract

Self-supervised neural language models with attention have recently been applied to biological sequence data, advancing structure, function and mutational effect prediction. Some protein language models, including MSA Transformer and AlphaFold's EvoFormer, take multiple sequence alignments (MSAs) of evolutionarily related proteins as inputs. Simple combinations of MSA Transformer's row attentions have led to state-of-the-art unsupervised structural contact prediction. We demonstrate that similarly simple, and universal, combinations of MSA Transformer's column attentions strongly correlate with Hamming distances between sequences in MSAs. Therefore, MSA-based language models encode detailed phylogenetic relationships. We further show that these models can separate coevolutionary signals encoding functional and structural constraints from phylogenetic correlations reflecting historical contingency. To assess this, we generate synthetic MSAs, either without or with phylogeny, from Potts models trained on natural MSAs. We find that unsupervised contact prediction is substantially more resilient to phylogenetic noise when using MSA Transformer versus inferred Potts models.

摘要

最近，具有注意力机制的自监督神经语言模型已被应用于生物序列数据，从而推动了结构、功能和突变效应预测的发展。一些蛋白质语言模型，包括 MSA Transformer 和 AlphaFold 的 EvoFormer，将进化相关蛋白质的多重序列比对（MSA）作为输入。MSA Transformer 的行注意力的简单组合已经导致了最先进的无监督结构接触预测。我们证明了类似简单且通用的 MSA Transformer 列注意力的组合与 MSA 中序列之间的汉明距离强烈相关。因此，基于 MSA 的语言模型编码了详细的系统发育关系。我们进一步表明，这些模型可以将编码功能和结构约束的共进化信号与反映历史偶然性的系统发育相关性区分开来。为了评估这一点，我们从基于自然 MSA 训练的 Potts 模型生成了没有或具有系统发育的合成 MSA。我们发现，与推断的 Potts 模型相比，使用 MSA Transformer 进行无监督接触预测对系统发育噪声的鲁棒性要强得多。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/64fb/9588007/89d2f2d4856a/41467_2022_34032_Fig1_HTML.jpg

相似文献

Protein language models trained on multiple sequence alignments learn phylogenetic relationships.基于多重序列比对训练的蛋白质语言模型可以学习系统发育关系。

Nat Commun. 2022 Oct 22;13(1):6298. doi: 10.1038/s41467-022-34032-y.

Generative power of a protein language model trained on multiple sequence alignments.基于多序列比对训练的蛋白质语言模型的生成能力。

Elife. 2023 Feb 3;12:e79854. doi: 10.7554/eLife.79854.

Pairing interacting protein sequences using masked language modeling.使用掩蔽语言模型对相互作用的蛋白质序列进行配对。

Proc Natl Acad Sci U S A. 2024 Jul 2;121(27):e2311887121. doi: 10.1073/pnas.2311887121. Epub 2024 Jun 24.

Measuring the distance between multiple sequence alignments.测量多个序列比对之间的距离。

Bioinformatics. 2012 Feb 15;28(4):495-502. doi: 10.1093/bioinformatics/btr701. Epub 2011 Dec 23.

Influence of multiple-sequence-alignment depth on Potts statistical models of protein covariation.多序列比对深度对蛋白质共变的 Potts 统计模型的影响。

Phys Rev E. 2019 Mar;99(3-1):032405. doi: 10.1103/PhysRevE.99.032405.

End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman.基于可微分 Smith-Waterman 的多序列比对端到端学习。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac724.

Leveraging protein language models for accurate multiple sequence alignments.利用蛋白质语言模型进行准确的多重序列比对。

Genome Res. 2023 Jul;33(7):1145-1153. doi: 10.1101/gr.277675.123. Epub 2023 Jul 6.

Impact of phylogeny on structural contact inference from protein sequence data.系统发育对从蛋白质序列数据推断结构接触的影响。

J R Soc Interface. 2023 Feb;20(199):20220707. doi: 10.1098/rsif.2022.0707. Epub 2023 Feb 8.

DeepECA: an end-to-end learning framework for protein contact prediction from a multiple sequence alignment.DeepECA：一种基于多重序列比对的蛋白质接触预测端到端学习框架。

BMC Bioinformatics. 2020 Jan 9;21(1):10. doi: 10.1186/s12859-019-3190-x.

Improved structure-related prediction for insufficient homologous proteins using MSA enhancement and pre-trained language model.利用多序列比对增强和预训练语言模型提高同源蛋白不足的结构相关预测。

Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad217.

引用本文的文献

Protein Structural Phylogenetics.蛋白质结构系统发育学

Genome Biol Evol. 2025 Jul 30;17(8). doi: 10.1093/gbe/evaf139.

Fine-Tuning Protein Language Models Unlocks the Potential of Underrepresented Viral Proteomes.微调蛋白质语言模型可释放未充分表征的病毒蛋白质组的潜力。

bioRxiv. 2025 Jun 11:2025.04.17.649224. doi: 10.1101/2025.04.17.649224.

Progress and challenges for the application of machine learning for neglected tropical diseases.机器学习在 neglected tropical diseases 中的应用进展与挑战。（注：“neglected tropical diseases”直译为“被忽视的热带病” ）

F1000Res. 2025 May 20;12:287. doi: 10.12688/f1000research.129064.2. eCollection 2023.

Phylogenetic Corrections and Higher-Order Sequence Statistics in Protein Families: The Potts Model vs MSA Transformer.蛋白质家族中的系统发育校正和高阶序列统计：Potts模型与多序列比对变换器

ArXiv. 2025 Mar 1:arXiv:2503.00289v1.

Artificial intelligence-powered prediction of AIM-2 inflammasome sequences using transformers and graph attention networks in periodontal inflammation.利用变压器和图注意力网络对牙周炎中AIM-2炎性小体序列进行人工智能驱动的预测。

Sci Rep. 2025 Mar 13;15(1):8733. doi: 10.1038/s41598-025-93409-3.

Do protein language models learn phylogeny?蛋白质语言模型能学习系统发育吗？

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf047.

Advances of deep Neural Networks (DNNs) in the development of peptide drugs.深度神经网络（DNN）在肽类药物开发中的进展。

Future Med Chem. 2025 Feb;17(4):485-499. doi: 10.1080/17568919.2025.2463319. Epub 2025 Feb 12.

Major advances in protein function assignment by remote homolog detection with protein language models - A review.利用蛋白质语言模型通过远程同源性检测进行蛋白质功能分配的重大进展——综述

Curr Opin Struct Biol. 2025 Feb;90:102984. doi: 10.1016/j.sbi.2025.102984. Epub 2025 Jan 27.

The Historical Evolution and Significance of Multiple Sequence Alignment in Molecular Structure and Function Prediction.多重序列比对在分子结构与功能预测中的历史演变及意义

Biomolecules. 2024 Nov 29;14(12):1531. doi: 10.3390/biom14121531.

SSEmb: A joint embedding of protein sequence and structure enables robust variant effect predictions.SSEmb：一种蛋白质序列和结构的联合嵌入方法，可实现稳健的变体效应预测。

Nat Commun. 2024 Nov 7;15(1):9646. doi: 10.1038/s41467-024-53982-z.

本文引用的文献

Generative power of a protein language model trained on multiple sequence alignments.基于多序列比对训练的蛋白质语言模型的生成能力。

Elife. 2023 Feb 3;12:e79854. doi: 10.7554/eLife.79854.

Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins.蛋白质语言模型的进化速度可预测多种蛋白质的进化动态。

Cell Syst. 2022 Apr 20;13(4):274-285.e6. doi: 10.1016/j.cels.2022.01.003. Epub 2022 Feb 3.

Extracting phylogenetic dimensions of coevolution reveals hidden functional signals.提取共进化的系统发育维度揭示隐藏的功能信号。

Sci Rep. 2022 Jan 17;12(1):820. doi: 10.1038/s41598-021-04260-1.

Interpreting Potts and Transformer Protein Models Through the Lens of Simplified Attention.通过简化注意力的视角来解释 Potts 和 Transformer 蛋白模型。

Pac Symp Biocomput. 2022;27:34-45.

The generative capacity of probabilistic protein sequence models.概率蛋白质序列模型的生成能力。

Nat Commun. 2021 Nov 2;12(1):6302. doi: 10.1038/s41467-021-26529-9.

Accurate prediction of protein structures and interactions using a three-track neural network.使用三轨神经网络准确预测蛋白质结构和相互作用。

Science. 2021 Aug 20;373(6557):871-876. doi: 10.1126/science.abj8754. Epub 2021 Jul 15.

Highly accurate protein structure prediction with AlphaFold.利用 AlphaFold 进行高精度蛋白质结构预测。

Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.

On the effect of phylogenetic correlations in coevolution-based contact prediction in proteins.在基于共同进化的蛋白质接触预测中，系统发育相关性的影响。

PLoS Comput Biol. 2021 May 24;17(5):e1008957. doi: 10.1371/journal.pcbi.1008957. eCollection 2021 May.

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。

Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.

Extraction of organic chemistry grammar from unsupervised learning of chemical reactions.从化学反应的无监督学习中提取有机化学语法

Sci Adv. 2021 Apr 7;7(15). doi: 10.1126/sciadv.abe4166. Print 2021 Apr.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于多重序列比对训练的蛋白质语言模型可以学习系统发育关系。

Protein language models trained on multiple sequence alignments learn phylogenetic relationships.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献