School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China.
General Hospital of Heilongjiang Province Land Reclamation Bureau, Harbin 150086, China.
Brief Bioinform. 2023 Sep 22;24(6). doi: 10.1093/bib/bbab556.
Protein function annotation is one of the most important research topics for revealing the essence of life at molecular level in the post-genome era. Current research shows that integrating multisource data can effectively improve the performance of protein function prediction models. However, the heavy reliance on complex feature engineering and model integration methods limits the development of existing methods. Besides, models based on deep learning only use labeled data in a certain dataset to extract sequence features, thus ignoring a large amount of existing unlabeled sequence data. Here, we propose an end-to-end protein function annotation model named HNetGO, which innovatively uses heterogeneous network to integrate protein sequence similarity and protein-protein interaction network information and combines the pretraining model to extract the semantic features of the protein sequence. In addition, we design an attention-based graph neural network model, which can effectively extract node-level features from heterogeneous networks and predict protein function by measuring the similarity between protein nodes and gene ontology term nodes. Comparative experiments on the human dataset show that HNetGO achieves state-of-the-art performance on cellular component and molecular function branches.
蛋白质功能注释是后基因组时代揭示生命本质的最重要的研究课题之一。目前的研究表明,整合多源数据可以有效地提高蛋白质功能预测模型的性能。然而,对复杂特征工程和模型集成方法的严重依赖限制了现有方法的发展。此外,基于深度学习的模型仅使用特定数据集的标记数据来提取序列特征,从而忽略了大量现有的未标记序列数据。在这里,我们提出了一个端到端的蛋白质功能注释模型,称为 HNetGO,它创新性地使用异构网络来整合蛋白质序列相似性和蛋白质-蛋白质相互作用网络信息,并结合预训练模型来提取蛋白质序列的语义特征。此外,我们设计了一个基于注意力的图神经网络模型,该模型可以从异构网络中有效地提取节点级特征,并通过测量蛋白质节点和基因本体论节点之间的相似性来预测蛋白质功能。在人类数据集上的对比实验表明,HNetGO 在细胞成分和分子功能分支上达到了最先进的性能。