Division of Pre-Clinical Innovation, National Center for Advancing Translational Sciences (NCATS), National Institutes of Health (NIH), Rockville, MD, United States.
Chief Technology Office, Booz Allen Hamilton, Bethesda, MD, United States.
J Am Med Inform Assoc. 2023 Dec 22;31(1):154-164. doi: 10.1093/jamia/ocad186.
Identifying sets of rare diseases with shared aspects of etiology and pathophysiology may enable drug repurposing. Toward that aim, we utilized an integrative knowledge graph to construct clusters of rare diseases.
Data on 3242 rare diseases were extracted from the National Center for Advancing Translational Science Genetic and Rare Diseases Information center internal data resources. The rare disease data enriched with additional biomedical data, including gene and phenotype ontologies, biological pathway data, and small molecule-target activity data, to create a knowledge graph (KG). Node embeddings were trained and clustered. We validated the disease clusters through semantic similarity and feature enrichment analysis.
Thirty-seven disease clusters were created with a mean size of 87 diseases. We validate the clusters quantitatively via semantic similarity based on the Orphanet Rare Disease Ontology. In addition, the clusters were analyzed for enrichment of associated genes, revealing that the enriched genes within clusters are highly related.
We demonstrate that node embeddings are an effective method for clustering diseases within a heterogenous KG. Semantically similar diseases and relevant enriched genes have been uncovered within the clusters. Connections between disease clusters and drugs are enumerated for follow-up efforts.
We lay out a method for clustering rare diseases using graph node embeddings. We develop an easy-to-maintain pipeline that can be updated when new data on rare diseases emerges. The embeddings themselves can be paired with other representation learning methods for other data types, such as drugs, to address other predictive modeling problems.
确定具有共同病因和病理生理学方面的罕见疾病集,可能实现药物再利用。为此,我们利用综合知识图谱构建罕见疾病簇。
从国家转化科学推进中心遗传和罕见疾病信息中心内部数据资源中提取了 3242 种罕见疾病的数据。利用包括基因和表型本体、生物途径数据和小分子-靶标活性数据在内的其他生物医学数据丰富罕见疾病数据,以创建知识图谱 (KG)。训练节点嵌入并对其进行聚类。通过语义相似性和特征富集分析验证疾病簇。
创建了 37 个疾病簇,平均大小为 87 种疾病。我们通过基于孤儿罕见病本体的语义相似性对簇进行了定量验证。此外,对簇进行了相关基因的富集分析,结果表明簇内的富集基因高度相关。
我们证明节点嵌入是在异构 KG 中对疾病进行聚类的有效方法。在簇内发现了语义相似的疾病和相关的富集基因。枚举了疾病簇与药物之间的联系,以便后续进行研究。
我们提出了一种使用图节点嵌入对罕见疾病进行聚类的方法。我们开发了一个易于维护的管道,当出现罕见疾病的新数据时,可以进行更新。嵌入本身可以与其他表示学习方法结合使用,例如药物,以解决其他预测建模问题。