Zhang Lei, Wang Yang, Chen Xiao, Hou Jie, Si Dong, Ding Rui, Jiang Bo, Ledenko Hailey, Cao Renzhi
Department of Computer Science and Technology, Anhui University, Hefei, Anhui 230601, China.
Computer Science Department, Hamilton College, Clinton, NY 13323, United States.
Bioinformatics. 2025 Jun 2;41(6). doi: 10.1093/bioinformatics/btaf267.
With the advancement of deep learning, researchers have increasingly proposed computational methods based on deep learning techniques to predict protein function. However, many of these methods treat protein function prediction as a multi-label classification problem, often overlooking the long-tail distribution of functional labels (i.e., Gene Ontology Terms) in datasets. To address this issue, we propose the GOBoost method, which incorporates the proposed long-tail optimization ensemble strategy. Besides, GOBoost introduces the proposed global-local label graph module and multi-granularity focal loss function to enhance long-tail functional information, mitigate the long-tail phenomenon, and improve overall prediction accuracy.
We evaluate GOBoost and other state-of-the-art (SOTA) protein function prediction methods on the PDB and AF2 datasets. The GOBoost outperformed SOTA methods on all evaluation metrics for both datasets. Notably, in the AUPR evaluation on the PDB test set, GOBoost improved by 10.71%, 35.91%, and 22.71% compared to the SOTA HEAL method in the MF, BP, and CC functions. The experimental results show the necessity and superiority of designing models from the label long-tail distribution perspective.
The source code of GOBoost is available at https://github.com/Cao-Labs/GOBoost.
随着深度学习的发展,研究人员越来越多地提出基于深度学习技术的计算方法来预测蛋白质功能。然而,这些方法中的许多将蛋白质功能预测视为多标签分类问题,常常忽略数据集中功能标签(即基因本体术语)的长尾分布。为了解决这个问题,我们提出了GOBoost方法,该方法采用了所提出的长尾优化集成策略。此外,GOBoost引入了所提出的全局-局部标签图模块和多粒度焦点损失函数,以增强长尾功能信息,减轻长尾现象,并提高整体预测准确性。
我们在PDB和AF2数据集上评估了GOBoost和其他最先进(SOTA)的蛋白质功能预测方法。在两个数据集的所有评估指标上,GOBoost均优于SOTA方法。值得注意的是,在PDB测试集的AUPR评估中,与SOTA的HEAL方法相比,GOBoost在MF、BP和CC功能上分别提高了10.71%、35.91%和22.71%。实验结果表明了从标签长尾分布角度设计模型的必要性和优越性。
GOBoost的源代码可在https://github.com/Cao-Labs/GOBoost获取。