通过学习和整合蛋白质序列及功能标签的表示来改进蛋白质功能预测。

Improving protein function prediction by learning and integrating representations of protein sequences and function labels.

作者信息

Boadu Frimpong, Cheng Jianlin

机构信息

Department of Electrical Engineering and Computer Science, NextGen Precision Health Institute, University of Missouri, Columbia, MO 65211, United States.

出版信息

Bioinform Adv. 2024 Aug 17;4(1):vbae120. doi: 10.1093/bioadv/vbae120. eCollection 2024.

DOI:10.1093/bioadv/vbae120

PMID:39233898

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11374024/

Abstract

MOTIVATION

As fewer than 1% of proteins have protein function information determined experimentally, computationally predicting the function of proteins is critical for obtaining functional information for most proteins and has been a major challenge in protein bioinformatics. Despite the significant progress made in protein function prediction by the community in the last decade, the general accuracy of protein function prediction is still not high, particularly for rare function terms associated with few proteins in the protein function annotation database such as the UniProt.

RESULTS

We introduce TransFew, a new transformer model, to learn the representations of both protein sequences and function labels [Gene Ontology (GO) terms] to predict the function of proteins. TransFew leverages a large pre-trained protein language model (ESM2-t48) to learn function-relevant representations of proteins from raw protein sequences and uses a biological natural language model (BioBert) and a graph convolutional neural network-based autoencoder to generate semantic representations of GO terms from their textual definition and hierarchical relationships, which are combined together to predict protein function via the cross-attention. Integrating the protein sequence and label representations not only enhances overall function prediction accuracy, but delivers a robust performance of predicting rare function terms with limited annotations by facilitating annotation transfer between GO terms.

AVAILABILITY AND IMPLEMENTATION

https://github.com/BioinfoMachineLearning/TransFew.

摘要

动机

由于通过实验确定蛋白质功能信息的蛋白质不到1%，因此通过计算预测蛋白质功能对于获取大多数蛋白质的功能信息至关重要，并且一直是蛋白质生物信息学中的一项重大挑战。尽管在过去十年中，该领域在蛋白质功能预测方面取得了显著进展，但蛋白质功能预测的总体准确性仍然不高，特别是对于与蛋白质功能注释数据库（如UniProt）中少数蛋白质相关的罕见功能术语。

结果

我们引入了一种新的Transformer模型TransFew，用于学习蛋白质序列和功能标签（基因本体论（GO）术语）的表示，以预测蛋白质的功能。TransFew利用一个大型预训练蛋白质语言模型（ESM2-t48）从原始蛋白质序列中学习与功能相关的蛋白质表示，并使用生物自然语言模型（BioBert）和基于图卷积神经网络的自动编码器从GO术语的文本定义和层次关系中生成语义表示，通过交叉注意力将它们组合在一起以预测蛋白质功能。整合蛋白质序列和标签表示不仅提高了整体功能预测准确性，而且通过促进GO术语之间的注释转移，在有限注释的情况下对罕见功能术语进行预测时具有强大的性能。