• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

KG2ML:整合知识图谱与正例无标注学习以识别疾病相关基因

KG2ML: Integrating Knowledge Graphs and Positive Unlabeled Learning for Identifying Disease-Associated Genes.

作者信息

Kumar Praveen, Metzger Vincent T, Purushotham Swastika T, Kedia Priyansh, Bologa Cristian G, Lambert Christophe G, Yang Jeremy J

机构信息

University of New Mexico (UNM), School of Medicine, Department of Internal Medicine, Translational Informatics Division, Albuquerque, New Mexico, USA.

出版信息

medRxiv. 2025 Mar 17:2025.03.17.25323906. doi: 10.1101/2025.03.17.25323906.

DOI:10.1101/2025.03.17.25323906
PMID:40166563
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11957101/
Abstract

BACKGROUND

Biomedical knowledge graphs (KGs), such as the Data Distillery Knowledge Graph (DDKG), capture known relationships among entities (e.g., genes, diseases, proteins), providing valuable insights for research. However, these relationships are typically derived from prior studies, leaving potential unknown associations unexplored. Identifying such unknown associations, including previously unknown disease-associated genes, remains a critical challenge in bioinformatics and is crucial for advancing biomedical knowledge. Traditional methods, such as linkage analysis and genome-wide association studies (GWAS), can be time-consuming and resource-intensive. This highlights the need for efficient computational approaches to identify or predict new genes using known disease-gene associations. Recently, network-based methods and KGs, enhanced by advances in machine learning (ML) frameworks, have emerged as promising tools for inferring these unexplored associations. Given the technical limitations of the Neo4j Graph Data Science (GDS) machine learning pipeline, we developed a novel machine learning pipeline called KG2ML (Knowledge Graph to Machine Learning). This pipeline utilizes our Positive and Unlabeled (PU) learning algorithm, PULSNAR (Positive Unlabeled Learning Selected Not At Random), and incorporates path-based feature extraction from ProteinGraphML.

RESULTS

KG2ML was applied to 12 diseases, including Bipolar Disorder, Coronary Artery Disease, and Parkinson's Disease, to infer disease-associated genes not explicitly recorded in DDKG. For several of these diseases, 14 out of the 15 top-ranked genes lacked prior explicit associations in the DDKG but were supported by literature and TINX (Target Importance and Novelty Explorer) evidence. Incorporating PULSNAR-imputed genes as positives enhanced XGBoost classification, demonstrating the potential of PU learning in identifying hidden gene-disease relationships.

CONCLUSION

The observed improvement in classification performance after the inclusion of PULSNAR-imputed genes as positive examples, along with the subject matter experts' (SME) evaluations of the top 15 imputed genes for 12 diseases, suggests that PU learning can effectively uncover disease-gene associations missing from existing knowledge graphs (KGs). By integrating KG data with ML-based inference, our KG2ML pipeline provides a scalable and interpretable framework to advance biomedical research while addressing the inherent limitations of current KGs.

摘要

背景

生物医学知识图谱(KGs),如数据提炼知识图谱(DDKG),捕捉实体(如基因、疾病、蛋白质)之间的已知关系,为研究提供有价值的见解。然而,这些关系通常来自先前的研究,潜在的未知关联尚未得到探索。识别此类未知关联,包括先前未知的疾病相关基因,仍然是生物信息学中的一项关键挑战,对于推进生物医学知识至关重要。传统方法,如连锁分析和全基因组关联研究(GWAS),可能既耗时又资源密集。这凸显了使用已知疾病-基因关联来识别或预测新基因的高效计算方法的必要性。最近,基于网络的方法和知识图谱,在机器学习(ML)框架进步的推动下,已成为推断这些未探索关联的有前途的工具。鉴于Neo4j图数据科学(GDS)机器学习管道的技术局限性,我们开发了一种名为KG2ML(知识图谱到机器学习)的新型机器学习管道。该管道利用我们的正例和未标记(PU)学习算法PULSNAR(非随机选择的正例未标记学习),并结合了来自ProteinGraphML的基于路径的特征提取。

结果

KG2ML应用于12种疾病,包括双相情感障碍、冠状动脉疾病和帕金森病,以推断DDKG中未明确记录的疾病相关基因。对于其中几种疾病,排名前15的基因中有14个在DDKG中缺乏先前的明确关联,但得到了文献和TINX(目标重要性和新颖性探索器)证据的支持。将PULSNAR估算的基因作为正例纳入增强了XGBoost分类,证明了PU学习在识别隐藏的基因-疾病关系方面的潜力。

结论

将PULSNAR估算的基因作为正例纳入后分类性能的观察到的改善,以及主题专家(SME)对12种疾病的前15个估算基因的评估,表明PU学习可以有效地发现现有知识图谱(KGs)中缺失的疾病-基因关联。通过将KG数据与基于ML的推理相结合,我们的KG2ML管道提供了一个可扩展且可解释的框架,以推进生物医学研究,同时解决当前KGs的固有局限性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/345a/11957101/50805935a8da/nihpp-2025.03.17.25323906v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/345a/11957101/7f1d8353f1e2/nihpp-2025.03.17.25323906v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/345a/11957101/dfcda42c9872/nihpp-2025.03.17.25323906v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/345a/11957101/257fa6aca087/nihpp-2025.03.17.25323906v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/345a/11957101/50805935a8da/nihpp-2025.03.17.25323906v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/345a/11957101/7f1d8353f1e2/nihpp-2025.03.17.25323906v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/345a/11957101/dfcda42c9872/nihpp-2025.03.17.25323906v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/345a/11957101/257fa6aca087/nihpp-2025.03.17.25323906v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/345a/11957101/50805935a8da/nihpp-2025.03.17.25323906v1-f0004.jpg

相似文献

1
KG2ML: Integrating Knowledge Graphs and Positive Unlabeled Learning for Identifying Disease-Associated Genes.KG2ML:整合知识图谱与正例无标注学习以识别疾病相关基因
medRxiv. 2025 Mar 17:2025.03.17.25323906. doi: 10.1101/2025.03.17.25323906.
2
Positive Unlabeled Learning Selected Not At Random (PULSNAR): class proportion estimation without the selected completely at random assumption.非随机选择的正无标记学习(PULSNAR):无需完全随机选择假设的类比例估计。
PeerJ Comput Sci. 2024 Nov 5;10:e2451. doi: 10.7717/peerj-cs.2451. eCollection 2024.
3
BioBLP: a modular framework for learning on multimodal biomedical knowledge graphs.BioBLP:一种用于多模态生物医学知识图谱学习的模块化框架。
J Biomed Semantics. 2023 Dec 8;14(1):20. doi: 10.1186/s13326-023-00301-y.
4
Positive-unlabeled learning for disease gene identification.基于正例无标记学习的疾病基因识别。
Bioinformatics. 2012 Oct 15;28(20):2640-7. doi: 10.1093/bioinformatics/bts504. Epub 2012 Aug 24.
5
Adverse Drug Event Prediction Using Noisy Literature-Derived Knowledge Graphs: Algorithm Development and Validation.使用有噪声的文献衍生知识图谱进行药物不良事件预测:算法开发与验证
JMIR Med Inform. 2021 Oct 25;9(10):e32730. doi: 10.2196/32730.
6
Implications of topological imbalance for representation learning on biomedical knowledge graphs.拓扑不平衡对生物医学知识图谱表示学习的影响。
Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac279.
7
Attention-based Knowledge Graph Representation Learning for Predicting Drug-drug Interactions.基于注意力机制的知识图谱表示学习在药物相互作用预测中的应用
Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac140.
8
PT-KGNN: A framework for pre-training biomedical knowledge graphs with graph neural networks.PT-KGNN:基于图神经网络的生物医学知识图谱预训练框架。
Comput Biol Med. 2024 Aug;178:108768. doi: 10.1016/j.compbiomed.2024.108768. Epub 2024 Jun 26.
9
Ensemble positive unlabeled learning for disease gene identification.用于疾病基因识别的集成正无标记学习
PLoS One. 2014 May 9;9(5):e97079. doi: 10.1371/journal.pone.0097079. eCollection 2014.
10
A scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learning.一种使用知识图谱和图机器学习来分析人类基因组变异的可扩展工具。
Front Big Data. 2025 Jan 21;7:1466391. doi: 10.3389/fdata.2024.1466391. eCollection 2024.

本文引用的文献

1
Detecting Opioid Use Disorder in Health Claims Data With Positive Unlabeled Learning.利用正无标记学习在健康保险理赔数据中检测阿片类药物使用障碍
IEEE J Biomed Health Inform. 2025 Feb;29(2):750-757. doi: 10.1109/JBHI.2024.3515805. Epub 2025 Feb 10.
2
Petagraph: A large-scale unifying knowledge graph framework for integrating biomolecular and biomedical data.Petagraph:一个用于整合生物分子和生物医学数据的大规模统一知识图谱框架。
Sci Data. 2024 Dec 18;11(1):1338. doi: 10.1038/s41597-024-04070-w.
3
Positive Unlabeled Learning Selected Not At Random (PULSNAR): class proportion estimation without the selected completely at random assumption.
非随机选择的正无标记学习(PULSNAR):无需完全随机选择假设的类比例估计。
PeerJ Comput Sci. 2024 Nov 5;10:e2451. doi: 10.7717/peerj-cs.2451. eCollection 2024.
4
TIN-X version 3: update with expanded dataset and modernized architecture for enhanced illumination of understudied targets.TIN-X 版本 3:更新了扩展数据集和现代化架构,以增强对研究不足目标的照明效果。
PeerJ. 2024 Jun 25;12:e17470. doi: 10.7717/peerj.17470. eCollection 2024.
5
Predicting gene disease associations with knowledge graph embeddings for diseases with curtailed information.利用知识图谱嵌入技术预测信息有限疾病的基因-疾病关联。
NAR Genom Bioinform. 2024 May 14;6(2):lqae049. doi: 10.1093/nargab/lqae049. eCollection 2024 Jun.
6
Predicting disease-gene associations through self-supervised mutual infomax graph convolution network.通过自监督互信息最大化图卷积网络预测疾病-基因关联。
Comput Biol Med. 2024 Mar;170:108048. doi: 10.1016/j.compbiomed.2024.108048. Epub 2024 Jan 30.
7
A knowledge graph approach to predict and interpret disease-causing gene interactions.一种基于知识图谱的疾病相关基因互作预测与解释方法。
BMC Bioinformatics. 2023 Aug 29;24(1):324. doi: 10.1186/s12859-023-05451-5.
8
A knowledge graph-based disease-gene prediction system using multi-relational graph convolution networks.基于知识图的多关系图卷积网络疾病-基因预测系统。
AMIA Annu Symp Proc. 2023 Apr 29;2022:468-476. eCollection 2022.
9
End-to-end interpretable disease-gene association prediction.端到端可解释的疾病-基因关联预测。
Brief Bioinform. 2023 May 19;24(3). doi: 10.1093/bib/bbad118.
10
GediNET for discovering gene associations across diseases using knowledge based machine learning approach.基于知识的机器学习方法发现疾病间基因关联的 GediNET。
Sci Rep. 2022 Nov 19;12(1):19955. doi: 10.1038/s41598-022-24421-0.