School of Data Science, Fudan University, 220 Handan Rd., Shanghai, 200433, China.
Comput Biol Med. 2024 Aug;178:108768. doi: 10.1016/j.compbiomed.2024.108768. Epub 2024 Jun 26.
Biomedical knowledge graphs (KGs) serve as comprehensive data repositories that contain rich information about nodes and edges, providing modeling capabilities for complex relationships among biological entities. Many approaches either learn node features through traditional machine learning methods, or leverage graph neural networks (GNNs) to directly learn features of target nodes in the biomedical KGs and utilize them for downstream tasks. Motivated by the pre-training technique in natural language processing (NLP), we propose a framework named PT-KGNN (Pre-Training the biomedical KG with GNNs) to learn embeddings of nodes in a broader context by applying GNNs on the biomedical KG. We design several experiments to evaluate the effectivity of our proposed framework and the impact of the scale of KGs. The results of tasks consistently improve as the scale of the biomedical KG used for pre-training increases. Pre-training on large-scale biomedical KGs significantly enhances the drug-drug interaction (DDI) and drug-disease association (DDA) prediction performance on the independent dataset. The embeddings derived from a larger biomedical KG have demonstrated superior performance compared to those obtained from a smaller KG. By applying pre-training techniques on biomedical KGs, rich semantic and structural information can be learned, leading to enhanced performance on downstream tasks. it is evident that pre-training techniques hold tremendous potential and wide-ranging applications in bioinformatics.
生物医学知识图谱 (KG) 作为综合性数据存储库,包含有关节点和边的丰富信息,为生物实体之间的复杂关系提供建模能力。许多方法要么通过传统的机器学习方法学习节点特征,要么利用图神经网络 (GNN) 直接学习生物医学 KG 中目标节点的特征,并将其用于下游任务。受自然语言处理 (NLP) 中的预训练技术的启发,我们提出了一个名为 PT-KGNN(使用 GNN 对生物医学 KG 进行预训练)的框架,通过在生物医学 KG 上应用 GNN 来在更广泛的上下文中学习节点的嵌入。我们设计了几个实验来评估我们提出的框架的有效性和 KG 规模的影响。随着用于预训练的生物医学 KG 规模的增加,任务的结果一致得到改善。在独立数据集上,大规模生物医学 KG 的预训练显著提高了药物-药物相互作用 (DDI) 和药物-疾病关联 (DDA) 预测性能。来自更大生物医学 KG 的嵌入表现优于来自较小 KG 的嵌入。通过在生物医学 KG 上应用预训练技术,可以学习丰富的语义和结构信息,从而提高下游任务的性能。显然,预训练技术在生物信息学中具有巨大的潜力和广泛的应用。