Tartici Alp, Nayar Gowri, Altman Russ B
Department of Genetics, Stanford University, Palo Alto, CA 94304, United States.
Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, United States.
Bioinformatics. 2025 Jun 2;41(6). doi: 10.1093/bioinformatics/btaf330.
Protein language models (PLMs) produce token-level embeddings for each residue, resulting in an output matrix with dimensions that vary based on sequence length. However, downstream machine learning models typically require fixed-length input vectors, necessitating a pooling method to compress the output matrix into a single vector representation of the entire protein. Traditional pooling methods often result in substantial information loss, impacting downstream task performance. We aim to develop a pooling method that produces more expressive general-purpose protein embedding vectors while offering biological interpretability.
We introduce Pool PaRTI, a novel pooling method that leverages internal transformer attention matrices and PageRank to assign token importance weights. Our unsupervised and parameter-free approach consistently prioritizes residues experimentally annotated as critical for function, assigning them higher importance scores. Across four diverse protein machine learning tasks, Pool PaRTI enables significant performance gains in predictive performance. Additionally, it enhances interpretability by identifying biologically relevant regions without relying on explicit structural data or annotated training. To assess generalizability, we evaluated Pool PaRTI with two encoder-only PLMs, confirming its robustness across different models.
Pool PaRTI is implemented in Python with PyTorch and is available at github.com/Helix-Research-Lab/Pool_PaRTI.git. The Pool PaRTI sequence embeddings and residue importance values for all human proteins on UniProt are available at zenodo.org/records/15036725 for ESM2 and protBERT.
蛋白质语言模型(PLM)为每个残基生成token级嵌入,从而得到一个维度随序列长度变化的输出矩阵。然而,下游机器学习模型通常需要固定长度的输入向量,因此需要一种池化方法将输出矩阵压缩成整个蛋白质的单个向量表示。传统的池化方法往往会导致大量信息丢失,影响下游任务的性能。我们旨在开发一种池化方法,该方法能生成更具表现力的通用蛋白质嵌入向量,同时提供生物学可解释性。
我们引入了Pool PaRTI,这是一种新颖的池化方法,它利用内部Transformer注意力矩阵和PageRank来分配token重要性权重。我们的无监督且无参数的方法始终将实验注释为对功能至关重要的残基作为优先考虑对象,为它们赋予更高的重要性分数。在四个不同的蛋白质机器学习任务中,Pool PaRTI在预测性能方面实现了显著提升。此外,它通过识别生物学相关区域增强了可解释性,而无需依赖明确的结构数据或注释训练。为了评估通用性,我们使用两个仅编码器的PLM对Pool PaRTI进行了评估,证实了它在不同模型中的稳健性。
Pool PaRTI用Python和PyTorch实现,可在github.com/Helix-Research-Lab/Pool_PaRTI.git获取。UniProt上所有人类蛋白质的Pool PaRTI序列嵌入和残基重要性值可在zenodo.org/records/15036725获取,适用于ESM2和protBERT。