Suppr超能文献

蛋白质语言模型学习相互作用序列基序的进化统计信息。

Protein language models learn evolutionary statistics of interacting sequence motifs.

机构信息

Harvard University, Cambridge, MA 02138.

Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139.

出版信息

Proc Natl Acad Sci U S A. 2024 Nov 5;121(45):e2406285121. doi: 10.1073/pnas.2406285121. Epub 2024 Oct 28.

Abstract

Protein language models (pLMs) have emerged as potent tools for predicting and designing protein structure and function, and the degree to which these models fundamentally understand the inherent biophysics of protein structure stands as an open question. Motivated by a finding that pLM-based structure predictors erroneously predict nonphysical structures for protein isoforms, we investigated the nature of sequence context needed for contact predictions in the pLM Evolutionary Scale Modeling (ESM-2). We demonstrate by use of a "categorical Jacobian" calculation that ESM-2 stores statistics of coevolving residues, analogously to simpler modeling approaches like Markov Random Fields and Multivariate Gaussian models. We further investigated how ESM-2 "stores" information needed to predict contacts by comparing sequence masking strategies, and found that providing local windows of sequence information allowed ESM-2 to best recover predicted contacts. This suggests that pLMs predict contacts by storing motifs of pairwise contacts. Our investigation highlights the limitations of current pLMs and underscores the importance of understanding the underlying mechanisms of these models.

摘要

蛋白质语言模型 (pLM) 已成为预测和设计蛋白质结构和功能的有力工具,但这些模型在多大程度上从根本上理解蛋白质结构的固有生物物理学仍是一个悬而未决的问题。受 pLM 为基础的结构预测器错误地预测蛋白质异构体的非物理结构这一发现的启发,我们研究了在 pLM 进化规模建模 (ESM-2) 中进行接触预测所需的序列上下文的性质。我们通过使用“分类雅可比”计算证明,ESM-2 存储了共进化残基的统计信息,类似于更简单的建模方法,如马尔可夫随机场和多元高斯模型。我们进一步通过比较序列掩蔽策略研究了 ESM-2 如何“存储”预测接触所需的信息,并发现提供局部序列信息窗口使 ESM-2 能够最好地恢复预测的接触。这表明 pLM 通过存储成对接触的模体来预测接触。我们的研究强调了当前 pLM 的局限性,并强调了理解这些模型的基本机制的重要性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/85db/11551344/e4c29441c564/pnas.2406285121fig01.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验