POSA-GO：用于蛋白质功能预测的分层基因本体与蛋白质语言模型融合

POSA-GO: Fusion of Hierarchical Gene Ontology and Protein Language Models for Protein Function Prediction.

作者信息

Liu Yubao, Wang Benrui, Yan Bocheng, Jiang Haiyue, Dai Yinfei

机构信息

College of Computer Science and Technology, Changchun University, Changchun 130012, China.

College of Computer Science and Technology, Jilin University, Changchun 130025, China.

出版信息

Int J Mol Sci. 2025 Jul 1;26(13):6362. doi: 10.3390/ijms26136362.

DOI:10.3390/ijms26136362

PMID:40650140

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12250456/

Abstract

Protein function prediction plays a crucial role in uncovering the molecular mechanisms underlying life processes in the post-genomic era. However, with the widespread adoption of high-throughput sequencing technologies, the pace of protein function annotation significantly lags behind that of sequence discovery, highlighting the urgent need for more efficient and reliable predictive methods. To address the problem of existing methods ignoring the hierarchical structure of gene ontology terms and making it challenging to dynamically associate protein features with functional contexts, we propose a novel protein function prediction framework, termed Partial Order-Based Self-Attention for Gene Ontology (POSA-GO). This cross-modal collaborative modelling approach fuses GO terms with protein sequences. The model leverages the pre-trained language model ESM-2 to extract deep semantic features from protein sequences. Meanwhile, it transforms the partial order relationships among Gene Ontology (GO) terms into topological embeddings to capture their biological hierarchical dependencies. Furthermore, a multi-head self-attention mechanism is employed to dynamically model the association weights between proteins and GO terms, thereby enabling context-aware functional annotation. Comparative experiments on the CAFA3 and SwissProt datasets demonstrate that POSA-GO outperforms existing state-of-the-art methods in terms of Fmax and AUPR metrics, offering a promising solution for protein functional studies.

摘要

蛋白质功能预测在揭示后基因组时代生命过程的分子机制中起着至关重要的作用。然而，随着高通量测序技术的广泛应用，蛋白质功能注释的速度明显落后于序列发现的速度，这凸显了对更高效、更可靠的预测方法的迫切需求。为了解决现有方法忽略基因本体术语的层次结构以及难以将蛋白质特征与功能上下文动态关联的问题，我们提出了一种新颖的蛋白质功能预测框架，称为基于偏序的基因本体自注意力（POSA-GO）。这种跨模态协作建模方法将基因本体术语与蛋白质序列融合。该模型利用预训练语言模型ESM-2从蛋白质序列中提取深度语义特征。同时，它将基因本体（GO）术语之间的偏序关系转换为拓扑嵌入，以捕捉它们的生物层次依赖性。此外，采用多头自注意力机制动态建模蛋白质与GO术语之间的关联权重，从而实现上下文感知功能注释。在CAFA3和SwissProt数据集上的对比实验表明，POSA-GO在Fmax和AUPR指标方面优于现有的最先进方法，为蛋白质功能研究提供了一个有前景的解决方案。