Zhu Yi-Heng, Zhu Shuxin, Yu Xuan, Yan He, Liu Yan, Xie Xiaojun, Yu Dong-Jun, Ye Rui
College of Artificial Intelligence, Nanjing Agricultural University, 666 Binjiang Avenue, Jiangbei New District, Nanjing, Jiangsu Province, 211800, China.
Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon Tong, Hong Kong SAR (HKG), 999077, China.
Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf420.
Accurately identifying protein functions is essential to understand life mechanisms and thus advance drug discovery. Although biochemical experiments are the gold standard for determining protein functions, they are often time-consuming and labor-intensive. Here, we proposed a novel composite deep-learning method, Multi-source Knowledge Fusion for Gene Ontology prediction (MKFGO), to infer Gene Ontology (GO) attributes through integrating five complementary pipelines built on multi-source biological data. MKFGO was rigorously benchmarked on 1522 nonredundant proteins, demonstrating superior performance over 12 state-of-the-art function prediction methods. Comprehensive data analyses revealed that the major advantage of MKFGO lies in its two deep-learning components, handcrafted feature representation-based GO prediction (HFRGO) and protein large language model (PLM)-based GO prediction (PLMGO), which derive handcrafted features and PLM-based features, respectively, from protein sequences in different biological views, with effective knowledge fusion at the decision-level. HFRGO leverages a long short-term memory (LSTM)-attention network embedded with handcrafted features, in which the triplet loss-based guilt-by-association strategy is designed to enhance the correlation between feature similarity and function similarity. PLMGO employs the PLM to capture feature embeddings with discriminative functional patterns from sequences. Meanwhile, another three components provide complementary insights for further improving prediction accuracy, driven by protein-protein interaction, GO term probability, and protein-coding gene sequence, respectively. The source codes and models of MKFGO are freely available at https://github.com/yiheng-zhu/MKFGO.
准确识别蛋白质功能对于理解生命机制进而推动药物研发至关重要。尽管生化实验是确定蛋白质功能的金标准,但它们往往耗时且费力。在此,我们提出了一种新颖的复合深度学习方法,即用于基因本体预测的多源知识融合(MKFGO),通过整合基于多源生物数据构建的五个互补管道来推断基因本体(GO)属性。MKFGO在1522个非冗余蛋白质上进行了严格的基准测试,证明其性能优于12种先进的功能预测方法。全面的数据分析表明,MKFGO的主要优势在于其两个深度学习组件,即基于手工特征表示的GO预测(HFRGO)和基于蛋白质大语言模型(PLM)的GO预测(PLMGO),它们分别从不同生物学视角的蛋白质序列中提取手工特征和基于PLM的特征,并在决策层面进行有效的知识融合。HFRGO利用嵌入手工特征的长短期记忆(LSTM)-注意力网络,其中基于三联体损失的关联定罪策略旨在增强特征相似性与功能相似性之间的相关性。PLMGO使用PLM从序列中捕获具有判别功能模式的特征嵌入。同时,另外三个组件分别由蛋白质-蛋白质相互作用、GO术语概率和蛋白质编码基因序列驱动,为进一步提高预测准确性提供互补见解。MKFGO的源代码和模型可在https://github.com/yiheng-zhu/MKFGO上免费获取。