Pool PaRTI：一种基于PageRank的池化方法，用于识别关键残基并增强蛋白质序列表示。

Pool PaRTI: A PageRank-Based Pooling Method for Identifying Critical Residues and Enhancing Protein Sequence Representations.

作者信息

Tartici Alp, Nayar Gowri, Altman Russ B

机构信息

Stanford University.

出版信息

bioRxiv. 2025 Mar 17:2024.10.04.616701. doi: 10.1101/2024.10.04.616701.

DOI:10.1101/2024.10.04.616701

PMID:40166178

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11956911/

Abstract

MOTIVATION

Protein language models produce token-level embeddings for each residue, resulting in an output matrix with dimensions that vary based on sequence length. However, downstream machine learning models typically require fixed-length input vectors, necessitating a pooling method to compress the output matrix into a single vector representation of the entire protein. Traditional pooling methods often result in substantial information loss, impacting downstream task performance. We aim to develop a pooling method that produces more expressive general-purpose protein embedding vectors while offering biological interpretability.

RESULTS

We introduce Pool PaRTI, a novel pooling method that leverages internal transformer attention matrices and PageRank to assign token importance weights. Our unsupervised and parameter-free approach consistently prioritizes residues experimentally annotated as critical for function, assigning them higher importance scores. Across four diverse protein machine learning tasks, Pool PaRTI enables significant performance gains in predictive performance. Additionally, it enhances interpretability by identifying biologically relevant regions without relying on explicit structural data or annotated training. To assess generalizability, we evaluated Pool PaRTI with two encoder-only protein language models, confirming its robustness across different models.

AVAILABILITY AND IMPLEMENTATION

Pool PaRTI is implemented in Python with PyTorch and is available at https://github.com/Helix-Research-Lab/Pool_PaRTI.git. The Pool PaRTI sequence embeddings and residue importance values for all human proteins on UniProt are available at https://zenodo.org/records/15036725 for ESM2 and protBERT.

摘要

动机

蛋白质语言模型为每个残基生成令牌级嵌入，从而产生一个维度随序列长度变化的输出矩阵。然而，下游机器学习模型通常需要固定长度的输入向量，因此需要一种池化方法将输出矩阵压缩成整个蛋白质的单个向量表示。传统的池化方法往往会导致大量信息丢失，影响下游任务的性能。我们旨在开发一种池化方法，该方法能生成更具表现力的通用蛋白质嵌入向量，同时提供生物学可解释性。

结果

我们引入了Pool PaRTI，这是一种新颖的池化方法，它利用内部Transformer注意力矩阵和PageRank来分配令牌重要性权重。我们的无监督且无参数的方法始终将实验注释为对功能至关重要的残基列为优先，为它们赋予更高的重要性分数。在四个不同的蛋白质机器学习任务中，Pool PaRTI在预测性能方面实现了显著的性能提升。此外，它通过识别生物学相关区域增强了可解释性，而无需依赖明确的结构数据或注释训练。为了评估通用性，我们使用两个仅编码器的蛋白质语言模型对Pool PaRTI进行了评估，证实了它在不同模型中的稳健性。