Pool PaRTI：一种基于PageRank的池化方法，用于识别关键残基并增强蛋白质序列表示。

Pool PaRTI: a PageRank-based pooling method for identifying critical residues and enhancing protein sequence representations.

作者信息

Tartici Alp, Nayar Gowri, Altman Russ B

机构信息

Department of Genetics, Stanford University, Palo Alto, CA 94304, United States.

Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, United States.

出版信息

Bioinformatics. 2025 Jun 2;41(6). doi: 10.1093/bioinformatics/btaf330.

DOI:10.1093/bioinformatics/btaf330

PMID:40455005

Abstract

MOTIVATION

Protein language models (PLMs) produce token-level embeddings for each residue, resulting in an output matrix with dimensions that vary based on sequence length. However, downstream machine learning models typically require fixed-length input vectors, necessitating a pooling method to compress the output matrix into a single vector representation of the entire protein. Traditional pooling methods often result in substantial information loss, impacting downstream task performance. We aim to develop a pooling method that produces more expressive general-purpose protein embedding vectors while offering biological interpretability.

RESULTS

We introduce Pool PaRTI, a novel pooling method that leverages internal transformer attention matrices and PageRank to assign token importance weights. Our unsupervised and parameter-free approach consistently prioritizes residues experimentally annotated as critical for function, assigning them higher importance scores. Across four diverse protein machine learning tasks, Pool PaRTI enables significant performance gains in predictive performance. Additionally, it enhances interpretability by identifying biologically relevant regions without relying on explicit structural data or annotated training. To assess generalizability, we evaluated Pool PaRTI with two encoder-only PLMs, confirming its robustness across different models.

AVAILABILITY AND IMPLEMENTATION

Pool PaRTI is implemented in Python with PyTorch and is available at github.com/Helix-Research-Lab/Pool_PaRTI.git. The Pool PaRTI sequence embeddings and residue importance values for all human proteins on UniProt are available at zenodo.org/records/15036725 for ESM2 and protBERT.

摘要

动机

蛋白质语言模型（PLM）为每个残基生成token级嵌入，从而得到一个维度随序列长度变化的输出矩阵。然而，下游机器学习模型通常需要固定长度的输入向量，因此需要一种池化方法将输出矩阵压缩成整个蛋白质的单个向量表示。传统的池化方法往往会导致大量信息丢失，影响下游任务的性能。我们旨在开发一种池化方法，该方法能生成更具表现力的通用蛋白质嵌入向量，同时提供生物学可解释性。

结果

我们引入了Pool PaRTI，这是一种新颖的池化方法，它利用内部Transformer注意力矩阵和PageRank来分配token重要性权重。我们的无监督且无参数的方法始终将实验注释为对功能至关重要的残基作为优先考虑对象，为它们赋予更高的重要性分数。在四个不同的蛋白质机器学习任务中，Pool PaRTI在预测性能方面实现了显著提升。此外，它通过识别生物学相关区域增强了可解释性，而无需依赖明确的结构数据或注释训练。为了评估通用性，我们使用两个仅编码器的PLM对Pool PaRTI进行了评估，证实了它在不同模型中的稳健性。

可用性与实现

Pool PaRTI用Python和PyTorch实现，可在github.com/Helix-Research-Lab/Pool_PaRTI.git获取。UniProt上所有人类蛋白质的Pool PaRTI序列嵌入和残基重要性值可在zenodo.org/records/15036725获取，适用于ESM2和protBERT。

相似文献

Pool PaRTI: a PageRank-based pooling method for identifying critical residues and enhancing protein sequence representations.

Bioinformatics. 2025 Jun 2;41(6). doi: 10.1093/bioinformatics/btaf330.

Pool PaRTI: A PageRank-Based Pooling Method for Identifying Critical Residues and Enhancing Protein Sequence Representations.

bioRxiv. 2025 Mar 17:2024.10.04.616701. doi: 10.1101/2024.10.04.616701.

iACP-DPNet: a dual-pooling causal dilated convolutional network for interpretable anticancer peptide identification.

Funct Integr Genomics. 2025 Jul 4;25(1):147. doi: 10.1007/s10142-025-01641-x.

Aggregating residue-level protein language model embeddings with optimal transport.

Bioinform Adv. 2025 Mar 20;5(1):vbaf060. doi: 10.1093/bioadv/vbaf060. eCollection 2025.

Enhancing Structure-Aware Protein Language Models with Efficient Fine-Tuning for Various Protein Prediction Tasks.

Methods Mol Biol. 2025;2941:31-58. doi: 10.1007/978-1-0716-4623-6_2.

FusionEncoder: identification of intrinsically disordered regions based on multi-feature fusion.

Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf362.

Are Current Survival Prediction Tools Useful When Treating Subsequent Skeletal-related Events From Bone Metastases?

Clin Orthop Relat Res. 2024 Sep 1;482(9):1710-1721. doi: 10.1097/CORR.0000000000003030. Epub 2024 Mar 22.

Leveraging a foundation model zoo for cell similarity search in oncological microscopy across devices.

Front Oncol. 2025 Jun 18;15:1480384. doi: 10.3389/fonc.2025.1480384. eCollection 2025.

Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.

Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

Stabilizing machine learning for reproducible and explainable results: A novel validation approach to subject-specific insights.

Comput Methods Programs Biomed. 2025 Jun 21;269:108899. doi: 10.1016/j.cmpb.2025.108899.

引用本文的文献

Deep learning models for unbiased sequence-based PPI prediction plateau at an accuracy of 0.65.

Bioinformatics. 2025 Jul 1;41(Supplement_1):i590-i598. doi: 10.1093/bioinformatics/btaf192.

本文引用的文献

Aggregating residue-level protein language model embeddings with optimal transport.

Bioinform Adv. 2025 Mar 20;5(1):vbaf060. doi: 10.1093/bioadv/vbaf060. eCollection 2025.

Comparative Assessment of Protein Large Language Models for Enzyme Commission Number Prediction.

BMC Bioinformatics. 2025 Feb 27;26(1):68. doi: 10.1186/s12859-025-06081-9.

Evaluating the advancements in protein language models for encoding strategies in protein function prediction: a comprehensive review.

Front Bioeng Biotechnol. 2025 Jan 21;13:1506508. doi: 10.3389/fbioe.2025.1506508. eCollection 2025.

LMPTMSite: A Platform for PTM Site Prediction in Proteins Leveraging Transformer-Based Protein Language Models.

Methods Mol Biol. 2025;2867:261-297. doi: 10.1007/978-1-0716-4196-5_16.

Cracking the black box of deep sequence-based protein-protein interaction prediction.

Brief Bioinform. 2024 Jan 22;25(2). doi: 10.1093/bib/bbae076.

Predicting enzymatic function of protein sequences with attention.

Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad620.

Evolutionary-scale prediction of atomic-level protein structure with a language model.

Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.

Light attention predicts protein location from the language of life.

Bioinform Adv. 2021 Nov 19;1(1):vbab035. doi: 10.1093/bioadv/vbab035. eCollection 2021.

DeepLoc 2.0: multi-label subcellular localization prediction using protein language models.

Nucleic Acids Res. 2022 Jul 5;50(W1):W228-W234. doi: 10.1093/nar/gkac278.

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.

IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127. doi: 10.1109/TPAMI.2021.3095381. Epub 2022 Sep 14.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Pool PaRTI：一种基于PageRank的池化方法，用于识别关键残基并增强蛋白质序列表示。

Pool PaRTI: a PageRank-based pooling method for identifying critical residues and enhancing protein sequence representations.

作者信息

Tartici Alp, Nayar Gowri, Altman Russ B

机构信息

Department of Genetics, Stanford University, Palo Alto, CA 94304, United States.

Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, United States.

出版信息

Bioinformatics. 2025 Jun 2;41(6). doi: 10.1093/bioinformatics/btaf330.

DOI:10.1093/bioinformatics/btaf330

PMID:40455005

Abstract

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

摘要

Pool PaRTI：一种基于PageRank的池化方法，用于识别关键残基并增强蛋白质序列表示。

Pool PaRTI: a PageRank-based pooling method for identifying critical residues and enhancing protein sequence representations.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性与实现

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

Pool PaRTI：一种基于PageRank的池化方法，用于识别关键残基并增强蛋白质序列表示。

Pool PaRTI: a PageRank-based pooling method for identifying critical residues and enhancing protein sequence representations.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性与实现

相似文献

引用本文的文献

本文引用的文献