Rajagopalan Ananya, Nguyen Tram Anh, Guare Lindsay A, Garao Rico Andre Luis, Venkatesh Rasika, Caruth Lannawill, Verma Anurag, Ritchie Marylyn D, Hall Molly A, Romano Joseph D, Setia-Verma Shefali
Genomics and Computational Biology Graduate Program.
Department of Genetics.
medRxiv. 2025 Aug 21:2025.08.19.25333942. doi: 10.1101/2025.08.19.25333942.
Multi-omics data are instrumental in obtaining a comprehensive picture of complex biological systems. This is particularly useful for women's health conditions, such as endometriosis which has been historically understudied despite having a high prevalence (around 10% of women of reproductive age). Subsequently, endometriosis has limited genetic characterization: current genome-wide association studies explain only 11% of its 47% total estimated heritability. Graph representations provide an intuitive and meaningful way to relate concepts across diverse data sources and address fundamental sparsity and dimensionality challenges with multi-omics data analysis. Here we present DRIVE-KG (Disease Risk Inference and Variant Exploration-Knowledge Graph), which uses a heterogeneous graph representation to integrate biological data from multi-omics datasets: dbSNP, NCBI Human Gene, Omics Pred, GTEx, and Open Targets. We drew directly from the knowledge captured in these data, using nodes to represent genes, single nucleotide polymorphisms, proteins, and phenotypes, and edges to represent relationships between these concepts. We trained two models using DRIVE-KG: a link prediction model to suggest associations between SNPs and two pilot phenotypes (endometriosis and obesity), and a graph convolutional network (GCN) to classify patient-level endometriosis status. We conducted the patient-level classification using data from 1,441 Penn Medicine BioBank participants with gold standard chart-reviewed endometriosis status. The link prediction model uncovered 66 high-confidence (score ≥ 0.95) previously unreported SNP-endometriosis associations. Many of these variants were linked to obesity/body mass index traits (24.2%), lipid metabolism (6%), and depressive disorders (4.5%), showing agreement with emerging hypotheses about endometriosis etiology. In contrast, 11% of the 149 high confidence, candidate SNP-obesity associations (score ≥ 0.9888) were in LD with known obesity associations. The GCN to classify patient endometriosis status had an AUPRC of 0.738 compared to 0.679 for a genetic risk score. Despite this moderate improvement, we found that the GCN learned meaningful stratification of underlying adenomyosis signal and severe grades of endometriosis. We have demonstrated that heterogeneous integration of multi-omics data is valuable for diverse downstream tasks-including discovery and clinical prediction-particularly for understudied diseases where traditional genomic approaches are insufficient.
多组学数据有助于全面了解复杂的生物系统。这对于女性健康状况尤为有用,例如子宫内膜异位症,尽管其患病率很高(约占育龄女性的10%),但历来研究不足。随后,子宫内膜异位症的遗传特征有限:目前的全基因组关联研究仅解释了其估计总遗传力的47%中的11%。图形表示提供了一种直观且有意义的方式来关联来自不同数据源的概念,并解决多组学数据分析中的基本稀疏性和维度挑战。在这里,我们展示了DRIVE-KG(疾病风险推断和变异探索-知识图谱),它使用异构图表示来整合来自多组学数据集的生物数据:dbSNP、NCBI人类基因、Omics Pred、GTEx和Open Targets。我们直接从这些数据中获取知识,使用节点表示基因、单核苷酸多态性、蛋白质和表型,使用边表示这些概念之间的关系。我们使用DRIVE-KG训练了两个模型:一个链接预测模型,用于建议单核苷酸多态性与两种试点表型(子宫内膜异位症和肥胖症)之间的关联;一个图卷积网络(GCN),用于对患者水平的子宫内膜异位症状态进行分类。我们使用来自1441名宾夕法尼亚大学医学银行参与者的数据进行患者水平的分类,这些参与者具有经过金标准图表审查的子宫内膜异位症状态。链接预测模型发现了66个高置信度(得分≥0.95)的先前未报告的单核苷酸多态性-子宫内膜异位症关联。其中许多变异与肥胖/体重指数特征(24.2%)、脂质代谢(6%)和抑郁症(4.5%)相关,这与关于子宫内膜异位症病因的新假设一致。相比之下,149个高置信度的候选单核苷酸多态性-肥胖症关联(得分≥0.9888)中有11%与已知的肥胖症关联处于连锁不平衡状态。用于对患者子宫内膜异位症状态进行分类的GCN的AUPRC为0.738,而遗传风险评分的AUPRC为0.679。尽管有这种适度的改进,但我们发现GCN学习到了子宫腺肌病潜在信号和严重程度的子宫内膜异位症的有意义分层。我们已经证明,多组学数据的异质整合对于各种下游任务(包括发现和临床预测)是有价值的,特别是对于传统基因组方法不足的研究较少的疾病。