Lin Weining, Miller David, Gu Zhonghui, Orengo Christine
Institute of Structural and Molecular Biology, University College London, London, UK.
Centre for Artificial Intelligence, University College London, London, UK.
Protein Sci. 2025 Jul;34(7):e70182. doi: 10.1002/pro.70182.
Accurate prediction of protein function is fundamental to understanding biological processes, with computational methods becoming increasingly essential as experimental methods struggle to keep pace with the rate of newly discovered proteins. Despite significant advances in machine learning approaches, existing methods often fail to capture the complex relationships between protein structure, evolution, and function, leading to limited prediction accuracy. The challenge lies in effectively integrating diverse biological data types while maintaining computational efficiency. Here, we show that GOBeacon, a novel ensemble model integrating structure-aware protein language model embeddings with protein-protein interaction networks, achieves high accuracy in protein function prediction. By employing a contrastive learning framework, GOBeacon demonstrates superior performance on the sequence-based CAFA3 benchmark, achieving F scores of 0.561 (BP), 0.583 (MF), and 0.651 (CC), outperforming existing methods including domain-PFP and DeepGOPlus. The model's effectiveness extends to structure-based function prediction tasks, where it matches or exceeds the performance of specialized structure-based tools like HEAL and DeepFRI, while not being explicitly trained on structure. We anticipate that GOBeacon's architecture will serve as a foundation for next-generation protein analysis tools, while its modular design enables future integration of additional data types and improved prediction capabilities. These advances represent a significant step toward reliable automated protein function annotation, addressing a critical bottleneck in modern biology. GOBeacon is now publicly available: https://github.com/wlin16/GOBeacon.git.
准确预测蛋白质功能是理解生物过程的基础,随着实验方法难以跟上新发现蛋白质的速度,计算方法变得越来越重要。尽管机器学习方法取得了重大进展,但现有方法往往无法捕捉蛋白质结构、进化和功能之间的复杂关系,导致预测准确性有限。挑战在于有效地整合各种生物数据类型,同时保持计算效率。在这里,我们展示了GOBeacon,一种将结构感知蛋白质语言模型嵌入与蛋白质-蛋白质相互作用网络相结合的新型集成模型,在蛋白质功能预测中实现了高精度。通过采用对比学习框架,GOBeacon在基于序列的CAFA3基准测试中表现出卓越的性能,在生物过程(BP)、分子功能(MF)和细胞组分(CC)方面的F分数分别达到0.561、0.583和0.651,优于包括Domain-PFP和DeepGOPlus在内的现有方法。该模型的有效性扩展到基于结构的功能预测任务,在该任务中它与HEAL和DeepFRI等专门的基于结构的工具的性能相匹配或超过它们,同时并未在结构上进行明确训练。我们预计GOBeacon的架构将成为下一代蛋白质分析工具的基础,而其模块化设计能够在未来集成更多数据类型并提高预测能力。这些进展代表了朝着可靠的自动化蛋白质功能注释迈出的重要一步,解决了现代生物学中的一个关键瓶颈。GOBeacon现已公开可用:https://github.com/wlin16/GOBeacon.git。