Suppr超能文献

使用蛋白质语言模型对蛋白质组进行功能注释:ProtTrans模型的高通量实现

Functional Annotation of Proteomes Using Protein Language Models: A High-Throughput Implementation of the ProtTrans Model.

作者信息

Cases Ildefonso, Martínez-Redondo Gemma, Fernández Rosa, Rojas Ana M

机构信息

CSIC, Andalusian Center for Developmental Biology, Computational Biology and Bioinformatics Group, Seville, Spain.

CSIC, Institute for Evolutionary Biology, Metazoa Phylogenomics Lab, Barcelona, Spain.

出版信息

Methods Mol Biol. 2025;2941:127-137. doi: 10.1007/978-1-0716-4623-6_8.

Abstract

Protein function prediction is critical for a wide range of applications in biology, spanning from functional genomics to protein design and genome evolution, among others. However, accurately predicting protein function remains a longstanding challenge in computational biology, especially for non-model organisms. Traditional methods based on sequence similarity often fail to annotate a significant proportion of proteins. The emergence of protein language models has significantly improved this process, enabling more accurate and comprehensive functional annotation. In this work, we highlight how the ProtTrans language model outperforms other tools in per-protein annotation, offering a more precise approach to predicting protein function. We also introduce functional annotation based on embedding space similarity (FANTASIA; available at https://github.com/MetazoaPhylogenomicsLab/FANTASIA ), a tool developed to harness these advances for large-scale annotation of uncharacterized proteomes. We provide a detailed overview of how to use FANTASIA, interpret its outputs, and demonstrate its utility in three case studies: (a) enrichment analyses from transcriptomics data, (b) assigning novel functions to unannotated genes in model organisms, and (c) identifying genes involved in important functions in non-model organisms. These results demonstrate the potential of protein language models to advance functional annotation in diverse biological contexts.

摘要

蛋白质功能预测对于生物学中的广泛应用至关重要,涵盖从功能基因组学到蛋白质设计以及基因组进化等诸多领域。然而,准确预测蛋白质功能仍然是计算生物学中一个长期存在的挑战,尤其是对于非模式生物而言。基于序列相似性的传统方法往往无法注释相当一部分蛋白质。蛋白质语言模型的出现显著改善了这一过程,能够实现更准确、更全面的功能注释。在这项工作中,我们强调了ProtTrans语言模型在每个蛋白质注释方面如何优于其他工具,为预测蛋白质功能提供了一种更精确的方法。我们还介绍了基于嵌入空间相似性的功能注释(FANTASIA;可在https://github.com/MetazoaPhylogenomicsLab/FANTASIA获取),这是一种为利用这些进展对未表征蛋白质组进行大规模注释而开发的工具。我们详细概述了如何使用FANTASIA、解释其输出结果,并在三个案例研究中展示其效用:(a)转录组学数据的富集分析,(b)为模式生物中未注释的基因赋予新功能,以及(c)识别非模式生物中参与重要功能的基因。这些结果证明了蛋白质语言模型在不同生物学背景下推进功能注释的潜力。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验