Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA.
Fixed Income Division, Morgan Stanley & Co LLC, New York, NY, USA.
J Biomed Inform. 2019 Oct;98:103270. doi: 10.1016/j.jbi.2019.103270. Epub 2019 Aug 22.
Discovering subphenotypes of complex diseases can help characterize disease cohorts for investigative studies aimed at developing better diagnoses and treatments. Recent advances in unsupervised machine learning on electronic health record (EHR) data have enabled researchers to discover phenotypes without input from domain experts. However, most existing studies have ignored time and modeled diseases as discrete events. Uncovering the evolution of phenotypes - how they emerge, evolve and contribute to health outcomes - is essential to define more precise phenotypes and refine the understanding of disease progression. Our objective was to assess the benefits of an unsupervised approach that incorporates time to model diseases as dynamic processes in phenotype discovery.
In this study, we applied a constrained non-negative tensor-factorization approach to characterize the complexity of cardiovascular disease (CVD) patient cohort based on longitudinal EHR data. Through tensor-factorization, we identified a set of phenotypic topics (i.e., subphenotypes) that these patients established over the 10 years prior to the diagnosis of CVD, and showed the progress pattern. For each identified subphenotype, we examined its association with the risk for adverse cardiovascular outcomes estimated by the American College of Cardiology/American Heart Association Pooled Cohort Risk Equations, a conventional CVD-risk assessment tool frequently used in clinical practice. Furthermore, we compared the subsequent myocardial infarction (MI) rates among the six most prevalent subphenotypes using survival analysis.
From a cohort of 12,380 adult CVD individuals with 1068 unique PheCodes, we successfully identified 14 subphenotypes. Through the association analysis with estimated CVD risk for each subtype, we found some phenotypic topics such as Vitamin D deficiency and depression, Urinary infections cannot be explained by the conventional risk factors. Through a survival analysis, we found markedly different risks of subsequent MI following the diagnosis of CVD among the six most prevalent topics (p < 0.0001), indicating these topics may capture clinically meaningful subphenotypes of CVD.
This study demonstrates the potential benefits of using tensor-decomposition to model diseases as dynamic processes from longitudinal EHR data. Our results suggest that this data-driven approach may potentially help researchers identify complex and chronic disease subphenotypes in precision medicine research.
发现复杂疾病的亚表型有助于为旨在开发更好的诊断和治疗方法的研究对疾病队列进行特征描述。电子健康记录 (EHR) 数据上无监督机器学习的最新进展使研究人员能够在没有领域专家输入的情况下发现表型。然而,大多数现有研究忽略了时间,并将疾病建模为离散事件。揭示表型的演变——它们是如何出现、演变并影响健康结果的——对于定义更精确的表型和深化对疾病进展的理解至关重要。我们的目标是评估一种无监督方法的益处,该方法将时间纳入表型发现中,将疾病建模为动态过程。
在这项研究中,我们应用受约束的非负张量分解方法来根据纵向 EHR 数据对心血管疾病 (CVD) 患者队列的复杂性进行特征描述。通过张量分解,我们确定了一组表型主题(即亚表型),这些患者在 CVD 诊断前的 10 年内建立了这些主题,并显示了进展模式。对于每个识别出的亚表型,我们检查了它与通过美国心脏病学会/美国心脏协会 Pooled Cohort Risk Equations 估计的不良心血管结局风险的关联,这是一种在临床实践中经常使用的常规 CVD 风险评估工具。此外,我们使用生存分析比较了六个最常见的亚表型之间的后续心肌梗死 (MI) 发生率。
从 12380 名患有 1068 个独特 PheCodes 的成年 CVD 个体的队列中,我们成功地识别出了 14 个亚表型。通过与每种亚型的估计 CVD 风险的关联分析,我们发现了一些表型主题,例如维生素 D 缺乏症和抑郁症,这些主题不能用传统的风险因素来解释。通过生存分析,我们发现六个最常见主题中 CVD 诊断后发生后续 MI 的风险明显不同(p < 0.0001),这表明这些主题可能捕获了 CVD 的临床有意义的亚表型。
本研究表明,使用张量分解将疾病建模为来自纵向 EHR 数据的动态过程的潜在益处。我们的结果表明,这种数据驱动的方法可能有助于研究人员在精准医学研究中识别复杂和慢性疾病的亚表型。