Nayar Gowri, Tartici Alp, Altman Russ B
Department of Biomedical Data Science, Stanford University, Stanford, California, United States of America.
Department of Genetics, Stanford University, Stanford, California, United States of America.
PLoS Comput Biol. 2025 Sep 12;21(9):e1013424. doi: 10.1371/journal.pcbi.1013424. eCollection 2025 Sep.
Protein Language Models (PLMs) use transformer architectures to capture patterns within protein primary sequences, providing a powerful computational representation of the amino acid sequence. Through large-scale training on protein primary sequences, PLMs generate vector representations that encapsulate the biochemical and structural properties of proteins. At the core of PLMs is the attention mechanism, which facilitates the capture of long-range dependencies by computing pairwise importance scores across residues, thereby highlighting regions of biological interaction within the sequence. The attention matrices offer an untapped opportunity to uncover specific biological properties of proteins, particularly their functions. In this work, we introduce a novel approach, using the Evolutionary Scale Modelling (ESM), for identifying High Attention (HA) sites within protein primary sequences, corresponding to key residues that define protein families. By examining attention patterns across multiple layers, we pinpoint residues that contribute most to family classification and function prediction. Our contributions are as follows: (1) we propose a method for identifying HA sites at critical residues from the middle layers of the PLM; (2) we demonstrate that these HA sites provide interpretable links to biological functions; and (3) we show that HA sites improve active site predictions for functions of unannotated proteins. We make available the HA sites for the human proteome. This work offers a broadly applicable approach to protein classification and functional annotation and provides a biological interpretation of the PLM's representation.
蛋白质语言模型(PLMs)使用变压器架构来捕捉蛋白质一级序列中的模式,提供氨基酸序列的强大计算表示。通过对蛋白质一级序列进行大规模训练,PLMs生成封装蛋白质生化和结构特性的向量表示。PLMs的核心是注意力机制,它通过计算残基间的成对重要性得分来促进对长程依赖性的捕捉,从而突出序列内的生物相互作用区域。注意力矩阵为揭示蛋白质的特定生物学特性,特别是其功能,提供了一个尚未开发的机会。在这项工作中,我们引入了一种使用进化尺度建模(ESM)的新方法,用于识别蛋白质一级序列中的高注意力(HA)位点,这些位点对应于定义蛋白质家族的关键残基。通过检查多层的注意力模式,我们确定了对家族分类和功能预测贡献最大的残基。我们的贡献如下:(1)我们提出了一种从PLM中间层识别关键残基处HA位点的方法;(2)我们证明这些HA位点提供了与生物学功能的可解释联系;(3)我们表明HA位点改善了未注释蛋白质功能的活性位点预测。我们提供了人类蛋白质组的HA位点。这项工作为蛋白质分类和功能注释提供了一种广泛适用的方法,并对PLM的表示进行了生物学解释。