Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States of America.
Division of Clinical Pharmacology, Vanderbilt University Medical Center, Nashville, TN, United States of America.
PLoS One. 2019 Feb 13;14(2):e0212112. doi: 10.1371/journal.pone.0212112. eCollection 2019.
Genome-wide and phenome-wide association studies are commonly used to identify important relationships between genetic variants and phenotypes. Most studies have treated diseases as independent variables and suffered from the burden of multiple adjustment due to the large number of genetic variants and disease phenotypes. In this study, we used topic modeling via non-negative matrix factorization (NMF) for identifying associations between disease phenotypes and genetic variants. Topic modeling is an unsupervised machine learning approach that can be used to learn patterns from electronic health record data. We chose the single nucleotide polymorphism (SNP) rs10455872 in LPA as the predictor since it has been shown to be associated with increased risk of hyperlipidemia and cardiovascular diseases (CVD). Using data of 12,759 individuals with electronic health records (EHR) and linked DNA samples at Vanderbilt University Medical Center, we trained a topic model using NMF from 1,853 distinct phenotypes and identified six topics. We tested their associations with rs10455872 in LPA. Topics enriched for CVD and hyperlipidemia had positive correlations with rs10455872 (P < 0.001), replicating a previous finding. We also identified a negative correlation between LPA and a topic enriched for lung cancer (P < 0.001) which was not previously identified via phenome-wide scanning. We were able to replicate the top finding in a separate dataset. Our results demonstrate the applicability of topic modeling in exploring the relationship between genetic variants and clinical diseases.
全基因组关联研究和表型全基因组关联研究常用于识别遗传变异与表型之间的重要关系。大多数研究将疾病作为自变量处理,由于遗传变异和疾病表型数量众多,因此受到多重调整的负担。在这项研究中,我们使用非负矩阵分解(NMF)的主题建模来识别疾病表型和遗传变异之间的关联。主题建模是一种无监督机器学习方法,可用于从电子健康记录数据中学习模式。我们选择 LPA 中的单核苷酸多态性(SNP)rs10455872 作为预测因子,因为它已被证明与血脂异常和心血管疾病(CVD)风险增加有关。使用范德比尔特大学医学中心的 12759 名具有电子健康记录(EHR)和相关 DNA 样本的个体数据,我们使用 NMF 从 1853 个不同的表型中训练了一个主题模型,并确定了六个主题。我们测试了它们与 LPA 中的 rs10455872 的关联。富含 CVD 和血脂异常的主题与 rs10455872 呈正相关(P<0.001),复制了先前的发现。我们还发现 LPA 与富含肺癌的主题之间存在负相关(P<0.001),这是以前通过表型全扫描未发现的。我们能够在另一个数据集上复制主要发现。我们的结果表明主题建模在探索遗传变异与临床疾病之间的关系方面具有适用性。
Circulation. 2018-10-23
Hum Mol Genet. 2015-4-15
N Engl J Med. 2009-12-24
Int J Gen Med. 2024-12-27
Sensors (Basel). 2023-7-21
Proc ACM Conf Health Inference Learn (2020). 2020-4
JMIR Med Inform. 2019-11-29
Circulation. 2018-10-23
Clin Pharmacol Ther. 2018-2-5
Transl Psychiatry. 2017-9-19
N Engl J Med. 2016-12-15
Springerplus. 2016-9-20
Nat Genet. 2016-10
Nat Genet. 2016-10