整合遗传和临床信息，以改善电子健康记录中缺失数据的推断。

Integration of genetic and clinical information to improve imputation of data missing from electronic health records.

机构信息

Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.

Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.

出版信息

J Am Med Inform Assoc. 2019 Oct 1;26(10):1056-1063. doi: 10.1093/jamia/ocz041.

DOI:10.1093/jamia/ocz041

PMID:31329892

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6748821/

Abstract

OBJECTIVE

Clinical data of patients' measurements and treatment history stored in electronic health record (EHR) systems are starting to be mined for better treatment options and disease associations. A primary challenge associated with utilizing EHR data is the considerable amount of missing data. Failure to address this issue can introduce significant bias in EHR-based research. Currently, imputation methods rely on correlations among the structured phenotype variables in the EHR. However, genetic studies have shown that many EHR-based phenotypes have a heritable component, suggesting that measured genetic variants might be useful for imputing missing data. In this article, we developed a computational model that incorporates patients' genetic information to perform EHR data imputation.

MATERIALS AND METHODS

We used the individual single nucleotide polymorphism's association with phenotype variables in the EHR as input to construct a genetic risk score that quantifies the genetic contribution to the phenotype. Multiple approaches to constructing the genetic risk score were evaluated for optimal performance. The genetic score, along with phenotype correlation, is then used as a predictor to impute the missing values.

RESULTS

To demonstrate the method performance, we applied our model to impute missing cardiovascular related measurements including low-density lipoprotein, heart failure, and aortic aneurysm disease in the electronic Medical Records and Genomics data. The integration method improved imputation's area-under-the-curve for binary phenotypes and decreased root-mean-square error for continuous phenotypes.

CONCLUSION

Compared with standard imputation approaches, incorporating genetic information offers a novel approach that can utilize more of the EHR data for better performance in missing data imputation.

摘要

目的

电子健康记录（EHR）系统中存储的患者测量值和治疗史的临床数据开始被挖掘，以寻求更好的治疗方案和疾病关联。利用 EHR 数据的主要挑战是大量数据缺失。如果不解决这个问题，基于 EHR 的研究可能会引入严重的偏差。目前，插补方法依赖于 EHR 中结构化表型变量之间的相关性。然而，遗传研究表明，许多基于 EHR 的表型具有遗传成分，这表明测量的遗传变异可能有助于插补缺失数据。在本文中，我们开发了一种计算模型，该模型利用患者的遗传信息来执行 EHR 数据插补。

材料与方法

我们使用个体单核苷酸多态性与 EHR 中表型变量的关联作为输入，构建一个遗传风险评分，该评分量化了表型的遗传贡献。评估了多种构建遗传风险评分的方法，以达到最佳性能。然后，遗传评分与表型相关性一起用作预测因子来插补缺失值。

结果

为了展示方法性能，我们将模型应用于插补缺失的心血管相关测量值，包括电子病历和基因组数据中的低密度脂蛋白、心力衰竭和主动脉瘤疾病。该集成方法提高了二值表型的插补曲线下面积，降低了连续表型的均方根误差。

结论

与标准插补方法相比，纳入遗传信息提供了一种新颖的方法，可以利用更多的 EHR 数据，在缺失数据插补中获得更好的性能。

相似文献

Integration of genetic and clinical information to improve imputation of data missing from electronic health records.整合遗传和临床信息，以改善电子健康记录中缺失数据的推断。

J Am Med Inform Assoc. 2019 Oct 1;26(10):1056-1063. doi: 10.1093/jamia/ocz041.

IDENTIFYING GENETIC ASSOCIATIONS WITH VARIABILITY IN METABOLIC HEALTH AND BLOOD COUNT LABORATORY VALUES: DIVING INTO THE QUANTITATIVE TRAITS BY LEVERAGING LONGITUDINAL DATA FROM AN EHR.识别与代谢健康和血细胞计数实验室值变异性相关的基因关联：利用电子健康记录中的纵向数据深入研究数量性状。

Pac Symp Biocomput. 2017;22:533-544. doi: 10.1142/9789813207813_0049.

INTEGRATING CLINICAL LABORATORY MEASURES AND ICD-9 CODE DIAGNOSES IN PHENOME-WIDE ASSOCIATION STUDIES.在全表型关联研究中整合临床实验室检测指标与ICD - 9编码诊断信息

Pac Symp Biocomput. 2016;21:168-79.

Impact of imputation methods on the amount of genetic variation captured by a single-nucleotide polymorphism panel in soybeans.插补方法对大豆单核苷酸多态性面板捕获的遗传变异量的影响。

BMC Bioinformatics. 2016 Feb 2;17:55. doi: 10.1186/s12859-016-0899-7.

Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis.电子健康记录中结构化缺失数据的特征描述与管理：数据分析

JMIR Med Inform. 2018 Feb 23;6(1):e11. doi: 10.2196/medinform.8960.

Imputation of missing values for electronic health record laboratory data.电子健康记录实验室数据缺失值的插补

NPJ Digit Med. 2021 Oct 11;4(1):147. doi: 10.1038/s41746-021-00518-0.

A novel missing data imputation approach based on clinical conditional Generative Adversarial Networks applied to EHR datasets.基于临床条件生成对抗网络的新型缺失数据插补方法在电子健康记录数据集的应用。

Comput Biol Med. 2023 Sep;163:107188. doi: 10.1016/j.compbiomed.2023.107188. Epub 2023 Jun 22.

A new analytical framework for missing data imputation and classification with uncertainty: Missing data imputation and heart failure readmission prediction.一种具有不确定性的缺失数据插补和分类的新分析框架：缺失数据插补和心力衰竭再入院预测。

PLoS One. 2020 Sep 21;15(9):e0237724. doi: 10.1371/journal.pone.0237724. eCollection 2020.

Chapter 13: Mining electronic health records in the genomics era.第十三章：基因组时代的电子健康记录挖掘。

PLoS Comput Biol. 2012;8(12):e1002823. doi: 10.1371/journal.pcbi.1002823. Epub 2012 Dec 27.

引用本文的文献

A deep learning transformer model predicts high rates of undiagnosed rare disease in large electronic health systems.一种深度学习变压器模型预测大型电子健康系统中未诊断罕见病的高发生率。

medRxiv. 2023 Dec 24:2023.12.21.23300393. doi: 10.1101/2023.12.21.23300393.

Creation of a structured molecular genomics report for Germany as a local adaption of HL7's Genomic Reporting Implementation Guide.为德国创建结构化分子基因组学报告，作为 HL7 的基因组学报告实施指南的本地化适应。

J Am Med Inform Assoc. 2023 May 19;30(6):1179-1189. doi: 10.1093/jamia/ocad061.

Machine learning approaches for electronic health records phenotyping: a methodical review.基于机器学习的电子健康记录表型分析方法：系统评价

J Am Med Inform Assoc. 2023 Jan 18;30(2):367-381. doi: 10.1093/jamia/ocac216.

Using set visualisation to find and explain patterns of missing values: a case study with NHS hospital episode statistics data.使用集合可视化来查找和解释缺失值模式：以英国国家医疗服务体系（NHS）住院统计数据为例的案例研究。

BMJ Open. 2022 Nov 21;12(11):e064887. doi: 10.1136/bmjopen-2022-064887.

The relationship of endothelial function and arterial stiffness with subclinical target organ damage in essential hypertension.原发性高血压患者亚临床靶器官损害与血管内皮功能及动脉僵硬度的关系。

J Clin Hypertens (Greenwich). 2022 Apr;24(4):418-429. doi: 10.1111/jch.14447. Epub 2022 Mar 3.

A narrative review on the validity of electronic health record-based research in epidemiology.基于电子健康记录的流行病学研究的有效性的叙述性综述。

BMC Med Res Methodol. 2021 Oct 27;21(1):234. doi: 10.1186/s12874-021-01416-5.

Importance-aware personalized learning for early risk prediction using static and dynamic health data.基于静态和动态健康数据的重要性感知个性化学习的早期风险预测

J Am Med Inform Assoc. 2021 Mar 18;28(4):713-726. doi: 10.1093/jamia/ocaa306.

Increasing the Density of Laboratory Measures for Machine Learning Applications.提高机器学习应用中实验室测量的密度

J Clin Med. 2020 Dec 30;10(1):103. doi: 10.3390/jcm10010103.

本文引用的文献

Electronic health records: the next wave of complex disease genetics.电子健康记录：复杂疾病遗传学的下一波浪潮。

Hum Mol Genet. 2018 May 1;27(R1):R14-R21. doi: 10.1093/hmg/ddy081.

Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis.电子健康记录中结构化缺失数据的特征描述与管理：数据分析

JMIR Med Inform. 2018 Feb 23;6(1):e11. doi: 10.2196/medinform.8960.

Prospects for using risk scores in polygenic medicine.多基因医学中风险评分的应用前景。

Genome Med. 2017 Nov 13;9(1):96. doi: 10.1186/s13073-017-0489-y.

10 Years of GWAS Discovery: Biology, Function, and Translation.全基因组关联研究十年发现：生物学、功能与转化

Am J Hum Genet. 2017 Jul 6;101(1):5-22. doi: 10.1016/j.ajhg.2017.06.005.

Polygenic transmission disequilibrium confirms that common and rare variation act additively to create risk for autism spectrum disorders.多基因传递不平衡证实，常见变异和罕见变异以累加方式作用，增加患自闭症谱系障碍的风险。

Nat Genet. 2017 Jul;49(7):978-985. doi: 10.1038/ng.3863. Epub 2017 May 15.

MISSING DATA IMPUTATION IN THE ELECTRONIC HEALTH RECORD USING DEEPLY LEARNED AUTOENCODERS.使用深度自编码器对电子健康记录中的缺失数据进行插补

Pac Symp Biocomput. 2017;22:207-218. doi: 10.1142/9789813207813_0021.

Polygenic score prediction captures nearly all common genetic risk for Alzheimer's disease.多基因评分预测几乎涵盖了阿尔茨海默病所有常见的遗传风险。

Neurobiol Aging. 2017 Jan;49:214.e7-214.e11. doi: 10.1016/j.neurobiolaging.2016.07.018. Epub 2016 Aug 5.

Contrasting the Genetic Architecture of 30 Complex Traits from Summary Association Data.基于汇总关联数据对比30种复杂性状的遗传结构

Am J Hum Genet. 2016 Jul 7;99(1):139-53. doi: 10.1016/j.ajhg.2016.05.013. Epub 2016 Jun 23.

Breast cancer risk prediction using a polygenic risk score in the familial setting: a prospective study from the Breast Cancer Family Registry and kConFab.在家族背景中使用多基因风险评分预测乳腺癌风险：来自乳腺癌家族登记处和kConFab的前瞻性研究

Genet Med. 2017 Jan;19(1):30-35. doi: 10.1038/gim.2016.43. Epub 2016 May 12.

Revealing rate-limiting steps in complex disease biology: The crucial importance of studying rare, extreme-phenotype families.揭示复杂疾病生物学中的限速步骤：研究罕见的极端表型家族的至关重要性。

Bioessays. 2016 Jun;38(6):578-86. doi: 10.1002/bies.201500203. Epub 2016 Apr 8.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验