使用蛋白质语言模型对蛋白质组进行功能注释：ProtTrans模型的高通量实现

Functional Annotation of Proteomes Using Protein Language Models: A High-Throughput Implementation of the ProtTrans Model.

作者信息

Cases Ildefonso, Martínez-Redondo Gemma, Fernández Rosa, Rojas Ana M

机构信息

CSIC, Andalusian Center for Developmental Biology, Computational Biology and Bioinformatics Group, Seville, Spain.

CSIC, Institute for Evolutionary Biology, Metazoa Phylogenomics Lab, Barcelona, Spain.

出版信息

Methods Mol Biol. 2025;2941:127-137. doi: 10.1007/978-1-0716-4623-6_8.

DOI:10.1007/978-1-0716-4623-6_8

PMID:40601255

Abstract

Protein function prediction is critical for a wide range of applications in biology, spanning from functional genomics to protein design and genome evolution, among others. However, accurately predicting protein function remains a longstanding challenge in computational biology, especially for non-model organisms. Traditional methods based on sequence similarity often fail to annotate a significant proportion of proteins. The emergence of protein language models has significantly improved this process, enabling more accurate and comprehensive functional annotation. In this work, we highlight how the ProtTrans language model outperforms other tools in per-protein annotation, offering a more precise approach to predicting protein function. We also introduce functional annotation based on embedding space similarity (FANTASIA; available at https://github.com/MetazoaPhylogenomicsLab/FANTASIA ), a tool developed to harness these advances for large-scale annotation of uncharacterized proteomes. We provide a detailed overview of how to use FANTASIA, interpret its outputs, and demonstrate its utility in three case studies: (a) enrichment analyses from transcriptomics data, (b) assigning novel functions to unannotated genes in model organisms, and (c) identifying genes involved in important functions in non-model organisms. These results demonstrate the potential of protein language models to advance functional annotation in diverse biological contexts.

摘要

蛋白质功能预测对于生物学中的广泛应用至关重要，涵盖从功能基因组学到蛋白质设计以及基因组进化等诸多领域。然而，准确预测蛋白质功能仍然是计算生物学中一个长期存在的挑战，尤其是对于非模式生物而言。基于序列相似性的传统方法往往无法注释相当一部分蛋白质。蛋白质语言模型的出现显著改善了这一过程，能够实现更准确、更全面的功能注释。在这项工作中，我们强调了ProtTrans语言模型在每个蛋白质注释方面如何优于其他工具，为预测蛋白质功能提供了一种更精确的方法。我们还介绍了基于嵌入空间相似性的功能注释（FANTASIA；可在https://github.com/MetazoaPhylogenomicsLab/FANTASIA获取），这是一种为利用这些进展对未表征蛋白质组进行大规模注释而开发的工具。我们详细概述了如何使用FANTASIA、解释其输出结果，并在三个案例研究中展示其效用：（a）转录组学数据的富集分析，（b）为模式生物中未注释的基因赋予新功能，以及（c）识别非模式生物中参与重要功能的基因。这些结果证明了蛋白质语言模型在不同生物学背景下推进功能注释的潜力。

相似文献

Functional Annotation of Proteomes Using Protein Language Models: A High-Throughput Implementation of the ProtTrans Model.使用蛋白质语言模型对蛋白质组进行功能注释：ProtTrans模型的高通量实现

Methods Mol Biol. 2025;2941:127-137. doi: 10.1007/978-1-0716-4623-6_8.

SAKit: An all-in-one analysis pipeline for identifying novel proteins resulting from variant events at both large and small scales.SAKit：一种用于鉴定由大尺度和小尺度变异事件产生的新型蛋白质的一体化分析管道。

J Bioinform Comput Biol. 2024 Oct;22(5):2450022. doi: 10.1142/S0219720024500227. Epub 2024 Oct 1.

ToxinPred 3.0: An improved method for predicting the toxicity of peptides.ToxinPred 3.0：一种改进的多肽毒性预测方法。

Comput Biol Med. 2024 Sep;179:108926. doi: 10.1016/j.compbiomed.2024.108926. Epub 2024 Jul 21.

Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.社区居住的老年人跌倒预防干预措施：系统评价和荟萃分析的益处、危害以及患者的价值观和偏好。

Syst Rev. 2024 Nov 26;13(1):289. doi: 10.1186/s13643-024-02681-3.

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病：网络荟萃分析。

Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.

Survivor, family and professional experiences of psychosocial interventions for sexual abuse and violence: a qualitative evidence synthesis.性虐待和暴力的心理社会干预的幸存者、家庭和专业人员的经验：定性证据综合。

Cochrane Database Syst Rev. 2022 Oct 4;10(10):CD013648. doi: 10.1002/14651858.CD013648.pub2.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.慢性斑块状银屑病的全身药理学治疗：一项网状Meta分析。

Cochrane Database Syst Rev. 2020 Jan 9;1(1):CD011535. doi: 10.1002/14651858.CD011535.pub3.

Are Current Survival Prediction Tools Useful When Treating Subsequent Skeletal-related Events From Bone Metastases?当前的生存预测工具在治疗骨转移后的骨骼相关事件时有用吗？

Clin Orthop Relat Res. 2024 Sep 1;482(9):1710-1721. doi: 10.1097/CORR.0000000000003030. Epub 2024 Mar 22.

Multi-objective context-guided consensus of a massive array of techniques for the inference of Gene Regulatory Networks.大规模技术的多目标上下文引导共识，用于基因调控网络推断。

Comput Biol Med. 2024 Sep;179:108850. doi: 10.1016/j.compbiomed.2024.108850. Epub 2024 Jul 15.

本文引用的文献

AIUPred: combining energy estimation with deep learning for the enhanced prediction of protein disorder.AIUPred：将能量估计与深度学习相结合，以增强对蛋白质无序性的预测。

Nucleic Acids Res. 2024 Jul 5;52(W1):W176-W181. doi: 10.1093/nar/gkae385.

Ten Years of Collaborative Progress in the Quest for Orthologs.寻找同源基因的十年协同进展。

Mol Biol Evol. 2021 Jul 29;38(8):3033-3045. doi: 10.1093/molbev/msab098.

Embeddings from deep learning transfer GO annotations beyond homology.深度学习的嵌入信息可以将 GO 注释扩展到同源之外。

Sci Rep. 2021 Jan 13;11(1):1160. doi: 10.1038/s41598-020-80786-0.

The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens.CAFA 挑战赛报告称，通过实验筛选，提高了数百个基因的蛋白质功能预测和新的功能注释。

Genome Biol. 2019 Nov 19;20(1):244. doi: 10.1186/s13059-019-1835-8.

DeepGOPlus: improved protein function prediction from sequence.DeepGOPlus：从序列中改进蛋白质功能预测。

Bioinformatics. 2020 Jan 15;36(2):422-429. doi: 10.1093/bioinformatics/btz595.

Activation of DAF-16/FOXO by reactive oxygen species contributes to longevity in long-lived mitochondrial mutants in Caenorhabditis elegans.活性氧激活 DAF-16/FOXO 有助于线虫中长寿的线粒体突变体的长寿。

PLoS Genet. 2018 Mar 9;14(3):e1007268. doi: 10.1371/journal.pgen.1007268. eCollection 2018 Mar.

DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier.DeepGO：使用深度本体感知分类器从序列和相互作用预测蛋白质功能。

Bioinformatics. 2018 Feb 15;34(4):660-668. doi: 10.1093/bioinformatics/btx624.

Functional Analogy in Human Metabolism: Enzymes with Different Biological Roles or Functional Redundancy?人类新陈代谢中的功能类比：具有不同生物学作用的酶还是功能冗余？

Genome Biol Evol. 2017 Jun 1;9(6):1624-1636. doi: 10.1093/gbe/evx119.

Unexpected features of the dark proteome.黑暗蛋白质组的意外特征。

Proc Natl Acad Sci U S A. 2015 Dec 29;112(52):15898-903. doi: 10.1073/pnas.1508380112. Epub 2015 Nov 17.

Hidden Markov model speed heuristic and iterative HMM search procedure.隐马尔可夫模型速度启发式和迭代隐马尔可夫模型搜索过程。

BMC Bioinformatics. 2010 Aug 18;11:431. doi: 10.1186/1471-2105-11-431.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用蛋白质语言模型对蛋白质组进行功能注释：ProtTrans模型的高通量实现

Functional Annotation of Proteomes Using Protein Language Models: A High-Throughput Implementation of the ProtTrans Model.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献