基于电子健康记录的张量分解检测时变表型主题：心血管疾病案例研究。

Detecting time-evolving phenotypic topics via tensor factorization on electronic health records: Cardiovascular disease case study.

机构信息

Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA.

Fixed Income Division, Morgan Stanley & Co LLC, New York, NY, USA.

出版信息

J Biomed Inform. 2019 Oct;98:103270. doi: 10.1016/j.jbi.2019.103270. Epub 2019 Aug 22.

DOI:10.1016/j.jbi.2019.103270

PMID:31445983

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6783385/

Abstract

OBJECTIVE

Discovering subphenotypes of complex diseases can help characterize disease cohorts for investigative studies aimed at developing better diagnoses and treatments. Recent advances in unsupervised machine learning on electronic health record (EHR) data have enabled researchers to discover phenotypes without input from domain experts. However, most existing studies have ignored time and modeled diseases as discrete events. Uncovering the evolution of phenotypes - how they emerge, evolve and contribute to health outcomes - is essential to define more precise phenotypes and refine the understanding of disease progression. Our objective was to assess the benefits of an unsupervised approach that incorporates time to model diseases as dynamic processes in phenotype discovery.

METHODS

In this study, we applied a constrained non-negative tensor-factorization approach to characterize the complexity of cardiovascular disease (CVD) patient cohort based on longitudinal EHR data. Through tensor-factorization, we identified a set of phenotypic topics (i.e., subphenotypes) that these patients established over the 10 years prior to the diagnosis of CVD, and showed the progress pattern. For each identified subphenotype, we examined its association with the risk for adverse cardiovascular outcomes estimated by the American College of Cardiology/American Heart Association Pooled Cohort Risk Equations, a conventional CVD-risk assessment tool frequently used in clinical practice. Furthermore, we compared the subsequent myocardial infarction (MI) rates among the six most prevalent subphenotypes using survival analysis.

RESULTS

From a cohort of 12,380 adult CVD individuals with 1068 unique PheCodes, we successfully identified 14 subphenotypes. Through the association analysis with estimated CVD risk for each subtype, we found some phenotypic topics such as Vitamin D deficiency and depression, Urinary infections cannot be explained by the conventional risk factors. Through a survival analysis, we found markedly different risks of subsequent MI following the diagnosis of CVD among the six most prevalent topics (p < 0.0001), indicating these topics may capture clinically meaningful subphenotypes of CVD.

CONCLUSION

This study demonstrates the potential benefits of using tensor-decomposition to model diseases as dynamic processes from longitudinal EHR data. Our results suggest that this data-driven approach may potentially help researchers identify complex and chronic disease subphenotypes in precision medicine research.

摘要

目的

发现复杂疾病的亚表型有助于为旨在开发更好的诊断和治疗方法的研究对疾病队列进行特征描述。电子健康记录 (EHR) 数据上无监督机器学习的最新进展使研究人员能够在没有领域专家输入的情况下发现表型。然而，大多数现有研究忽略了时间，并将疾病建模为离散事件。揭示表型的演变——它们是如何出现、演变并影响健康结果的——对于定义更精确的表型和深化对疾病进展的理解至关重要。我们的目标是评估一种无监督方法的益处，该方法将时间纳入表型发现中，将疾病建模为动态过程。

方法

在这项研究中，我们应用受约束的非负张量分解方法来根据纵向 EHR 数据对心血管疾病 (CVD) 患者队列的复杂性进行特征描述。通过张量分解，我们确定了一组表型主题（即亚表型），这些患者在 CVD 诊断前的 10 年内建立了这些主题，并显示了进展模式。对于每个识别出的亚表型，我们检查了它与通过美国心脏病学会/美国心脏协会 Pooled Cohort Risk Equations 估计的不良心血管结局风险的关联，这是一种在临床实践中经常使用的常规 CVD 风险评估工具。此外，我们使用生存分析比较了六个最常见的亚表型之间的后续心肌梗死 (MI) 发生率。

结果

从 12380 名患有 1068 个独特 PheCodes 的成年 CVD 个体的队列中，我们成功地识别出了 14 个亚表型。通过与每种亚型的估计 CVD 风险的关联分析，我们发现了一些表型主题，例如维生素 D 缺乏症和抑郁症，这些主题不能用传统的风险因素来解释。通过生存分析，我们发现六个最常见主题中 CVD 诊断后发生后续 MI 的风险明显不同（p < 0.0001），这表明这些主题可能捕获了 CVD 的临床有意义的亚表型。

结论

本研究表明，使用张量分解将疾病建模为来自纵向 EHR 数据的动态过程的潜在益处。我们的结果表明，这种数据驱动的方法可能有助于研究人员在精准医学研究中识别复杂和慢性疾病的亚表型。

相似文献

Detecting time-evolving phenotypic topics via tensor factorization on electronic health records: Cardiovascular disease case study.基于电子健康记录的张量分解检测时变表型主题：心血管疾病案例研究。

J Biomed Inform. 2019 Oct;98:103270. doi: 10.1016/j.jbi.2019.103270. Epub 2019 Aug 22.

HPO2Vec+: Leveraging heterogeneous knowledge resources to enrich node embeddings for the Human Phenotype Ontology.HPO2Vec+：利用异构知识资源丰富人类表型本体的节点嵌入。

J Biomed Inform. 2019 Aug;96:103246. doi: 10.1016/j.jbi.2019.103246. Epub 2019 Jun 27.

Electronic health record phenotyping improves detection and screening of type 2 diabetes in the general United States population: A cross-sectional, unselected, retrospective study.电子健康记录表型分析改善了美国普通人群中2型糖尿病的检测和筛查：一项横断面、非选择性、回顾性研究。

J Biomed Inform. 2016 Apr;60:162-8. doi: 10.1016/j.jbi.2015.12.006. Epub 2015 Dec 17.

Temporal phenotyping of medically complex children via PARAFAC2 tensor factorization.通过 PARAFAC2 张量分解对医学上复杂的儿童进行时间表型分析。

J Biomed Inform. 2019 May;93:103125. doi: 10.1016/j.jbi.2019.103125. Epub 2019 Feb 8.

Limestone: high-throughput candidate phenotype generation via tensor factorization.石灰岩：通过张量分解进行高通量候选表型生成。

J Biomed Inform. 2014 Dec;52:199-211. doi: 10.1016/j.jbi.2014.07.001. Epub 2014 Jul 16.

Relational machine learning for electronic health record-driven phenotyping.用于电子健康记录驱动的表型分析的关系机器学习。

J Biomed Inform. 2014 Dec;52:260-70. doi: 10.1016/j.jbi.2014.07.007. Epub 2014 Jul 15.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Learning from Longitudinal Data in Electronic Health Record and Genetic Data to Improve Cardiovascular Event Prediction.从电子健康记录和遗传数据中学习以改善心血管事件预测。

Sci Rep. 2019 Jan 24;9(1):717. doi: 10.1038/s41598-018-36745-x.

Rubik: Knowledge Guided Tensor Factorization and Completion for Health Data Analytics.鲁比克：用于健康数据分析的知识引导张量分解与补全

KDD. 2015 Aug;2015:1265-1274. doi: 10.1145/2783258.2783395.

MixEHR-Guided: A guided multi-modal topic modeling approach for large-scale automatic phenotyping using the electronic health record.混合 EHR 引导：一种使用电子健康记录进行大规模自动表型分析的引导式多模态主题建模方法。

J Biomed Inform. 2022 Oct;134:104190. doi: 10.1016/j.jbi.2022.104190. Epub 2022 Sep 1.

引用本文的文献

Identifying and predicting headache trajectories among those with acute post-traumatic headache.识别和预测急性创伤后头痛患者的头痛病程。

Headache. 2025 Jul-Aug;65(7):1124-1133. doi: 10.1111/head.14955. Epub 2025 May 30.

Identifying progression subphenotypes of Alzheimer's disease from large-scale electronic health records with machine learning.利用机器学习从大规模电子健康记录中识别阿尔茨海默病的进展亚表型。

J Biomed Inform. 2025 May;165:104820. doi: 10.1016/j.jbi.2025.104820. Epub 2025 Apr 1.

A Dynamic Time Warping Extension to Consensus Weight-Based Cachexia Criteria Improves Prediction of Cancer Patient Outcomes.基于共识权重的恶病质标准的动态时间规整扩展改善了癌症患者预后的预测。

JCSM Commun. 2025 Jan-Jun;8(1):e107. doi: 10.1002/rco2.107. Epub 2025 Jan 29.

Impact of tooth loss and patient characteristics on coronary artery calcium score classification and prediction.牙齿缺失及患者特征对冠状动脉钙评分分类和预测的影响。

Sci Rep. 2024 Nov 16;14(1):28315. doi: 10.1038/s41598-024-79900-3.

Censored Least Squares for Imputing Missing Values in PARAFAC Tensor Factorization.用于在PARAFAC张量分解中插补缺失值的截尾最小二乘法

bioRxiv. 2024 Jul 10:2024.07.05.602272. doi: 10.1101/2024.07.05.602272.

Soft phenotyping for sepsis via EHR time-aware soft clustering.基于 EHR 的时间感知软聚类进行脓毒症的软表型分析。

J Biomed Inform. 2024 Apr;152:104615. doi: 10.1016/j.jbi.2024.104615. Epub 2024 Feb 27.

A methodology of phenotyping ICU patients from EHR data: High-fidelity, personalized, and interpretable phenotypes estimation.从电子健康记录数据中对 ICU 患者进行表型分析的方法：高保真、个性化且可解释的表型估计。

J Biomed Inform. 2023 Dec;148:104547. doi: 10.1016/j.jbi.2023.104547. Epub 2023 Nov 18.

Artificial Intelligence-Based Methods for Precision Cardiovascular Medicine.基于人工智能的精准心血管医学方法

J Pers Med. 2023 Aug 16;13(8):1268. doi: 10.3390/jpm13081268.

Improving Diagnostics with Deep Forest Applied to Electronic Health Records.深度学习森林在电子健康记录中的应用提高诊断能力。

Sensors (Basel). 2023 Jul 21;23(14):6571. doi: 10.3390/s23146571.

Missing data matter: an empirical evaluation of the impacts of missing EHR data in comparative effectiveness research.缺失数据很重要：缺失电子健康记录数据对比较有效性研究影响的实证评估。

J Am Med Inform Assoc. 2023 Jun 20;30(7):1246-1256. doi: 10.1093/jamia/ocad066.

本文引用的文献

Rubik: Knowledge Guided Tensor Factorization and Completion for Health Data Analytics.鲁比克：用于健康数据分析的知识引导张量分解与补全

KDD. 2015 Aug;2015:1265-1274. doi: 10.1145/2783258.2783395.

Advances in Electronic Phenotyping: From Rule-Based Definitions to Machine Learning Models.电子表型分析的进展：从基于规则的定义到机器学习模型

Annu Rev Biomed Data Sci. 2018 Jul;1:53-68. doi: 10.1146/annurev-biodatasci-080917-013315. Epub 2018 May 23.

2018 American Heart Association/American College of Cardiology Multisociety Guideline on the Management of Blood Cholesterol: Primary Prevention.2018年美国心脏协会/美国心脏病学会多学会血胆固醇管理指南：一级预防

JAMA Cardiol. 2019 May 1;4(5):488-489. doi: 10.1001/jamacardio.2019.0777.

Using topic modeling via non-negative matrix factorization to identify relationships between genetic variants and disease phenotypes: A case study of Lipoprotein(a) (LPA).利用非负矩阵分解的主题建模来识别遗传变异与疾病表型之间的关系：脂蛋白(a)（LPA）的案例研究。

PLoS One. 2019 Feb 13;14(2):e0212112. doi: 10.1371/journal.pone.0212112. eCollection 2019.

Temporal phenotyping of medically complex children via PARAFAC2 tensor factorization.通过 PARAFAC2 张量分解对医学上复杂的儿童进行时间表型分析。

J Biomed Inform. 2019 May;93:103125. doi: 10.1016/j.jbi.2019.103125. Epub 2019 Feb 8.

Management of Blood Cholesterol.血液胆固醇的管理

JAMA. 2019 Feb 26;321(8):800-801. doi: 10.1001/jama.2019.0015.

Learning from Longitudinal Data in Electronic Health Record and Genetic Data to Improve Cardiovascular Event Prediction.从电子健康记录和遗传数据中学习以改善心血管事件预测。

Sci Rep. 2019 Jan 24;9(1):717. doi: 10.1038/s41598-018-36745-x.

Spark That Lights the Fire: Infection Triggers Cardiovascular Events.火花引发火灾：感染引发心血管事件。

J Am Heart Assoc. 2018 Nov 20;7(22):e011175. doi: 10.1161/JAHA.118.011175.

Inpatient and Outpatient Infection as a Trigger of Cardiovascular Disease: The ARIC Study.住院和门诊感染作为心血管疾病的触发因素：ARIC 研究。

J Am Heart Assoc. 2018 Nov 20;7(22):e009683. doi: 10.1161/JAHA.118.009683.

Unsupervised Discovery of Demixed, Low-Dimensional Neural Dynamics across Multiple Timescales through Tensor Component Analysis.通过张量成分分析，在多个时间尺度上对混合的、低维神经动力学进行无监督发现。

Neuron. 2018 Jun 27;98(6):1099-1115.e8. doi: 10.1016/j.neuron.2018.05.015. Epub 2018 Jun 7.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验