Kumar Praveen, Metzger Vincent T, Purushotham Swastika T, Kedia Priyansh, Bologa Cristian G, Lambert Christophe G, Yang Jeremy J
University of New Mexico (UNM), School of Medicine, Department of Internal Medicine, Translational Informatics Division, Albuquerque, New Mexico, USA.
medRxiv. 2025 Mar 17:2025.03.17.25323906. doi: 10.1101/2025.03.17.25323906.
Biomedical knowledge graphs (KGs), such as the Data Distillery Knowledge Graph (DDKG), capture known relationships among entities (e.g., genes, diseases, proteins), providing valuable insights for research. However, these relationships are typically derived from prior studies, leaving potential unknown associations unexplored. Identifying such unknown associations, including previously unknown disease-associated genes, remains a critical challenge in bioinformatics and is crucial for advancing biomedical knowledge. Traditional methods, such as linkage analysis and genome-wide association studies (GWAS), can be time-consuming and resource-intensive. This highlights the need for efficient computational approaches to identify or predict new genes using known disease-gene associations. Recently, network-based methods and KGs, enhanced by advances in machine learning (ML) frameworks, have emerged as promising tools for inferring these unexplored associations. Given the technical limitations of the Neo4j Graph Data Science (GDS) machine learning pipeline, we developed a novel machine learning pipeline called KG2ML (Knowledge Graph to Machine Learning). This pipeline utilizes our Positive and Unlabeled (PU) learning algorithm, PULSNAR (Positive Unlabeled Learning Selected Not At Random), and incorporates path-based feature extraction from ProteinGraphML.
KG2ML was applied to 12 diseases, including Bipolar Disorder, Coronary Artery Disease, and Parkinson's Disease, to infer disease-associated genes not explicitly recorded in DDKG. For several of these diseases, 14 out of the 15 top-ranked genes lacked prior explicit associations in the DDKG but were supported by literature and TINX (Target Importance and Novelty Explorer) evidence. Incorporating PULSNAR-imputed genes as positives enhanced XGBoost classification, demonstrating the potential of PU learning in identifying hidden gene-disease relationships.
The observed improvement in classification performance after the inclusion of PULSNAR-imputed genes as positive examples, along with the subject matter experts' (SME) evaluations of the top 15 imputed genes for 12 diseases, suggests that PU learning can effectively uncover disease-gene associations missing from existing knowledge graphs (KGs). By integrating KG data with ML-based inference, our KG2ML pipeline provides a scalable and interpretable framework to advance biomedical research while addressing the inherent limitations of current KGs.
生物医学知识图谱(KGs),如数据提炼知识图谱(DDKG),捕捉实体(如基因、疾病、蛋白质)之间的已知关系,为研究提供有价值的见解。然而,这些关系通常来自先前的研究,潜在的未知关联尚未得到探索。识别此类未知关联,包括先前未知的疾病相关基因,仍然是生物信息学中的一项关键挑战,对于推进生物医学知识至关重要。传统方法,如连锁分析和全基因组关联研究(GWAS),可能既耗时又资源密集。这凸显了使用已知疾病-基因关联来识别或预测新基因的高效计算方法的必要性。最近,基于网络的方法和知识图谱,在机器学习(ML)框架进步的推动下,已成为推断这些未探索关联的有前途的工具。鉴于Neo4j图数据科学(GDS)机器学习管道的技术局限性,我们开发了一种名为KG2ML(知识图谱到机器学习)的新型机器学习管道。该管道利用我们的正例和未标记(PU)学习算法PULSNAR(非随机选择的正例未标记学习),并结合了来自ProteinGraphML的基于路径的特征提取。
KG2ML应用于12种疾病,包括双相情感障碍、冠状动脉疾病和帕金森病,以推断DDKG中未明确记录的疾病相关基因。对于其中几种疾病,排名前15的基因中有14个在DDKG中缺乏先前的明确关联,但得到了文献和TINX(目标重要性和新颖性探索器)证据的支持。将PULSNAR估算的基因作为正例纳入增强了XGBoost分类,证明了PU学习在识别隐藏的基因-疾病关系方面的潜力。
将PULSNAR估算的基因作为正例纳入后分类性能的观察到的改善,以及主题专家(SME)对12种疾病的前15个估算基因的评估,表明PU学习可以有效地发现现有知识图谱(KGs)中缺失的疾病-基因关联。通过将KG数据与基于ML的推理相结合,我们的KG2ML管道提供了一个可扩展且可解释的框架,以推进生物医学研究,同时解决当前KGs的固有局限性。