Suppr超能文献

超快经典系统发育方法在变异效应预测方面胜过大型蛋白质语言模型。

Ultrafast classical phylogenetic method beats large protein language models on variant effect prediction.

作者信息

Prillo Sebastian, Wu Wilson, Song Yun S

机构信息

University of California, Berkeley.

出版信息

Adv Neural Inf Process Syst. 2024;37:130265-130290.

Abstract

Amino acid substitution rate matrices are fundamental to statistical phylogenetics and evolutionary biology. Estimating them typically requires reconstructed trees for massive amounts of aligned proteins, which poses a major computational bottleneck. In this paper, we develop a near-linear time method to estimate these rate matrices from multiple sequence alignments (MSAs) alone, thereby speeding up computation by orders of magnitude. Our method relies on a near-linear time cherry reconstruction algorithm which we call and it can be easily applied to MSAs with millions of sequences. On both simulated and real data, we demonstrate the speed and accuracy of our method as applied to the classical model of protein evolution. By leveraging the unprecedented scalability of our method, we develop a new, rich phylogenetic model called , which can estimate a general rate matrix for each column of an MSA. Remarkably, in variant effect prediction for both clinical and deep mutational scanning data in ProteinGym, we show that despite being an independent-sites model, our SiteRM model outperforms large protein language models that learn complex residue-residue interactions between different sites. We attribute our increased performance to conceptual advances in our probabilistic treatment of evolutionary data and our ability to handle extremely large MSAs. We anticipate that our work will have a lasting impact across both statistical phylogenetics and computational variant effect prediction. FastCherries and SiteRM are implemented in the CherryML package https://github.com/songlab-cal/CherryML.

摘要

氨基酸替换率矩阵是统计系统发育学和进化生物学的基础。估计它们通常需要大量比对蛋白质的重建树,这构成了一个主要的计算瓶颈。在本文中,我们开发了一种近线性时间方法,仅从多序列比对(MSA)中估计这些率矩阵,从而将计算速度提高几个数量级。我们的方法依赖于一种我们称为的近线性时间樱桃重建算法,它可以很容易地应用于包含数百万个序列的MSA。在模拟数据和真实数据上,我们都展示了我们的方法应用于蛋白质进化经典模型时的速度和准确性。通过利用我们方法前所未有的可扩展性,我们开发了一种新的、丰富的系统发育模型,称为,它可以为MSA的每一列估计一个通用的率矩阵。值得注意的是,在ProteinGym中对临床和深度突变扫描数据的变异效应预测中,我们表明,尽管我们的SiteRM模型是一个独立位点模型,但它优于学习不同位点之间复杂残基-残基相互作用的大型蛋白质语言模型。我们将性能的提高归因于我们在进化数据概率处理方面的概念进步以及我们处理极大MSA的能力。我们预计我们的工作将对统计系统发育学和计算变异效应预测产生持久影响。FastCherries和SiteRM在CherryML包https://github.com/songlab-cal/CherryML中实现。

相似文献

10

本文引用的文献

2
ProGen2: Exploring the boundaries of protein language models.ProGen2:探索蛋白质语言模型的边界。
Cell Syst. 2023 Nov 15;14(11):968-978.e3. doi: 10.1016/j.cels.2023.10.002. Epub 2023 Oct 30.
4

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验