Wang Beibei, Cui Boyue, Chen Shiqu, Wang Xuan, Wang Yadong, Li Junyi
School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China.
Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China.
Bioinformatics. 2025 May 6;41(5). doi: 10.1093/bioinformatics/btaf285.
In recent years, protein function prediction has broken through the bottleneck of sequence features, significantly improving prediction accuracy using high-precision protein structures predicted by AlphaFold2. While single-species protein function prediction methods have achieved remarkable success, multi-species approaches still face challenges such as difficulties in multi-source data integration and insufficient knowledge transfer between distantly-related species. How to integrate large-scale data and provide effective cross-species label propagation for species with sparse protein annotations remains a critical and unresolved challenge. To address this problem, we propose the MSNGO (Multi-species protein Structures and Network to predict GO terms) model, which integrates structural features and network propagation methods. Our validation shows that using structural features can significantly improve the accuracy of multi-species protein function prediction.
We employ graph representation learning techniques to extract amino acid representations from protein structure contact maps and train a structural model using a graph convolution pooling module to derive protein-level structural features. After incorporating the sequence features from ESM-2, we apply a network propagation algorithm to aggregate information and update node representations within a heterogeneous network. The results demonstrate that MSNGO outperforms previous multi-species protein function prediction methods that rely on sequence features and protein-protein networks.
近年来,蛋白质功能预测突破了序列特征的瓶颈,利用AlphaFold2预测的高精度蛋白质结构显著提高了预测准确性。虽然单物种蛋白质功能预测方法取得了显著成功,但多物种方法仍面临多源数据整合困难以及远缘物种间知识转移不足等挑战。如何整合大规模数据并为蛋白质注释稀疏的物种提供有效的跨物种标签传播仍是一个关键且未解决的挑战。为解决此问题,我们提出了MSNGO(多物种蛋白质结构与网络预测GO术语)模型,该模型整合了结构特征和网络传播方法。我们的验证表明,使用结构特征可显著提高多物种蛋白质功能预测的准确性。
我们采用图表示学习技术从蛋白质结构接触图中提取氨基酸表示,并使用图卷积池化模块训练一个结构模型以导出蛋白质水平的结构特征。在纳入来自ESM-2的序列特征后,我们应用网络传播算法在异构网络内聚合信息并更新节点表示。结果表明,MSNGO优于先前依赖序列特征和蛋白质-蛋白质网络的多物种蛋白质功能预测方法。