超快经典系统发育方法在变异效应预测方面胜过大型蛋白质语言模型。

Ultrafast classical phylogenetic method beats large protein language models on variant effect prediction.

作者信息

Prillo Sebastian, Wu Wilson, Song Yun S

机构信息

University of California, Berkeley.

出版信息

Adv Neural Inf Process Syst. 2024;37:130265-130290.

PMID:40487750

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12143485/

Abstract

Amino acid substitution rate matrices are fundamental to statistical phylogenetics and evolutionary biology. Estimating them typically requires reconstructed trees for massive amounts of aligned proteins, which poses a major computational bottleneck. In this paper, we develop a near-linear time method to estimate these rate matrices from multiple sequence alignments (MSAs) alone, thereby speeding up computation by orders of magnitude. Our method relies on a near-linear time cherry reconstruction algorithm which we call and it can be easily applied to MSAs with millions of sequences. On both simulated and real data, we demonstrate the speed and accuracy of our method as applied to the classical model of protein evolution. By leveraging the unprecedented scalability of our method, we develop a new, rich phylogenetic model called , which can estimate a general rate matrix for each column of an MSA. Remarkably, in variant effect prediction for both clinical and deep mutational scanning data in ProteinGym, we show that despite being an independent-sites model, our SiteRM model outperforms large protein language models that learn complex residue-residue interactions between different sites. We attribute our increased performance to conceptual advances in our probabilistic treatment of evolutionary data and our ability to handle extremely large MSAs. We anticipate that our work will have a lasting impact across both statistical phylogenetics and computational variant effect prediction. FastCherries and SiteRM are implemented in the CherryML package https://github.com/songlab-cal/CherryML.

摘要

氨基酸替换率矩阵是统计系统发育学和进化生物学的基础。估计它们通常需要大量比对蛋白质的重建树，这构成了一个主要的计算瓶颈。在本文中，我们开发了一种近线性时间方法，仅从多序列比对（MSA）中估计这些率矩阵，从而将计算速度提高几个数量级。我们的方法依赖于一种我们称为的近线性时间樱桃重建算法，它可以很容易地应用于包含数百万个序列的MSA。在模拟数据和真实数据上，我们都展示了我们的方法应用于蛋白质进化经典模型时的速度和准确性。通过利用我们方法前所未有的可扩展性，我们开发了一种新的、丰富的系统发育模型，称为，它可以为MSA的每一列估计一个通用的率矩阵。值得注意的是，在ProteinGym中对临床和深度突变扫描数据的变异效应预测中，我们表明，尽管我们的SiteRM模型是一个独立位点模型，但它优于学习不同位点之间复杂残基-残基相互作用的大型蛋白质语言模型。我们将性能的提高归因于我们在进化数据概率处理方面的概念进步以及我们处理极大MSA的能力。我们预计我们的工作将对统计系统发育学和计算变异效应预测产生持久影响。FastCherries和SiteRM在CherryML包https://github.com/songlab-cal/CherryML中实现。

相似文献

Ultrafast classical phylogenetic method beats large protein language models on variant effect prediction.超快经典系统发育方法在变异效应预测方面胜过大型蛋白质语言模型。

Adv Neural Inf Process Syst. 2024;37:130265-130290.

Embeddings from protein language models predict conservation and variant effects.基于蛋白质语言模型的嵌入模型可预测保守性和变异效应。

Hum Genet. 2022 Oct;141(10):1629-1647. doi: 10.1007/s00439-021-02411-y. Epub 2021 Dec 30.

BetaAlign: a deep learning approach for multiple sequence alignment.BetaAlign：一种用于多序列比对的深度学习方法。

Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btaf009.

PFASUM: a substitution matrix from Pfam structural alignments.PFASUM：一种来自Pfam结构比对的替换矩阵。

BMC Bioinformatics. 2017 Jun 5;18(1):293. doi: 10.1186/s12859-017-1703-z.

Improving Protein Secondary Structure Prediction by Deep Language Models and Transformer Networks.深度学习语言模型和变换网络在蛋白质二级结构预测中的改进。

Methods Mol Biol. 2025;2867:43-53. doi: 10.1007/978-1-0716-4196-5_3.

Protein language models trained on multiple sequence alignments learn phylogenetic relationships.基于多重序列比对训练的蛋白质语言模型可以学习系统发育关系。

Nat Commun. 2022 Oct 22;13(1):6298. doi: 10.1038/s41467-022-34032-y.

Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map.使用完全似然得分和位置偏移图对多序列比对错误进行表征。

BMC Bioinformatics. 2016 Mar 18;17:133. doi: 10.1186/s12859-016-0945-5.

FastMG: a simple, fast, and accurate maximum likelihood procedure to estimate amino acid replacement rate matrices from large data sets.FastMG：一种简单、快速且准确的最大似然程序，用于从大型数据集中估计氨基酸替换率矩阵。

BMC Bioinformatics. 2014 Oct 24;15(1):341. doi: 10.1186/1471-2105-15-341.

Leveraging protein language models for accurate multiple sequence alignments.利用蛋白质语言模型进行准确的多重序列比对。

Genome Res. 2023 Jul;33(7):1145-1153. doi: 10.1101/gr.277675.123. Epub 2023 Jul 6.

CherryML: scalable maximum likelihood estimation of phylogenetic models.CherryML：可扩展的系统发育模型极大似然估计。

Nat Methods. 2023 Aug;20(8):1232-1236. doi: 10.1038/s41592-023-01917-9. Epub 2023 Jun 29.

本文引用的文献

Convolutions are competitive with transformers for protein sequence pretraining.卷积运算在蛋白质序列预训练方面与转换器竞争。

Cell Syst. 2024 Mar 20;15(3):286-294.e2. doi: 10.1016/j.cels.2024.01.008. Epub 2024 Feb 29.

ProGen2: Exploring the boundaries of protein language models.ProGen2：探索蛋白质语言模型的边界。

Cell Syst. 2023 Nov 15;14(11):968-978.e3. doi: 10.1016/j.cels.2023.10.002. Epub 2023 Oct 30.

Masked inverse folding with sequence transfer for protein representation learning.用于蛋白质表示学习的带序列转移的掩码反向折叠

Protein Eng Des Sel. 2023 Jan 21;36. doi: 10.1093/protein/gzad015.

CherryML: scalable maximum likelihood estimation of phylogenetic models.CherryML：可扩展的系统发育模型极大似然估计。

Nat Methods. 2023 Aug;20(8):1232-1236. doi: 10.1038/s41592-023-01917-9. Epub 2023 Jun 29.

Robust deep learning-based protein sequence design using ProteinMPNN.使用 ProteinMPNN 进行健壮的基于深度学习的蛋白质序列设计。

Science. 2022 Oct 7;378(6615):49-56. doi: 10.1126/science.add2187. Epub 2022 Sep 15.

nQMaker: Estimating Time Nonreversible Amino Acid Substitution Models.nQMaker：估计时间不可逆氨基酸替换模型。

Syst Biol. 2022 Aug 10;71(5):1110-1123. doi: 10.1093/sysbio/syac007.

Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins.蛋白质语言模型的进化速度可预测多种蛋白质的进化动态。

Cell Syst. 2022 Apr 20;13(4):274-285.e6. doi: 10.1016/j.cels.2022.01.003. Epub 2022 Feb 3.

Embeddings from protein language models predict conservation and variant effects.基于蛋白质语言模型的嵌入模型可预测保守性和变异效应。

Hum Genet. 2022 Oct;141(10):1629-1647. doi: 10.1007/s00439-021-02411-y. Epub 2021 Dec 30.

Disease variant prediction with deep generative models of evolutionary data.利用进化数据的深度生成模型进行疾病变异预测。

Nature. 2021 Nov;599(7883):91-95. doi: 10.1038/s41586-021-04043-8. Epub 2021 Oct 27.

Protein design and variant prediction using autoregressive generative models.使用自回归生成模型进行蛋白质设计和变体预测。

Nat Commun. 2021 Apr 23;12(1):2403. doi: 10.1038/s41467-021-22732-w.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验