Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.
Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.
J Am Med Inform Assoc. 2019 Oct 1;26(10):1056-1063. doi: 10.1093/jamia/ocz041.
Clinical data of patients' measurements and treatment history stored in electronic health record (EHR) systems are starting to be mined for better treatment options and disease associations. A primary challenge associated with utilizing EHR data is the considerable amount of missing data. Failure to address this issue can introduce significant bias in EHR-based research. Currently, imputation methods rely on correlations among the structured phenotype variables in the EHR. However, genetic studies have shown that many EHR-based phenotypes have a heritable component, suggesting that measured genetic variants might be useful for imputing missing data. In this article, we developed a computational model that incorporates patients' genetic information to perform EHR data imputation.
We used the individual single nucleotide polymorphism's association with phenotype variables in the EHR as input to construct a genetic risk score that quantifies the genetic contribution to the phenotype. Multiple approaches to constructing the genetic risk score were evaluated for optimal performance. The genetic score, along with phenotype correlation, is then used as a predictor to impute the missing values.
To demonstrate the method performance, we applied our model to impute missing cardiovascular related measurements including low-density lipoprotein, heart failure, and aortic aneurysm disease in the electronic Medical Records and Genomics data. The integration method improved imputation's area-under-the-curve for binary phenotypes and decreased root-mean-square error for continuous phenotypes.
Compared with standard imputation approaches, incorporating genetic information offers a novel approach that can utilize more of the EHR data for better performance in missing data imputation.
电子健康记录(EHR)系统中存储的患者测量值和治疗史的临床数据开始被挖掘,以寻求更好的治疗方案和疾病关联。利用 EHR 数据的主要挑战是大量数据缺失。如果不解决这个问题,基于 EHR 的研究可能会引入严重的偏差。目前,插补方法依赖于 EHR 中结构化表型变量之间的相关性。然而,遗传研究表明,许多基于 EHR 的表型具有遗传成分,这表明测量的遗传变异可能有助于插补缺失数据。在本文中,我们开发了一种计算模型,该模型利用患者的遗传信息来执行 EHR 数据插补。
我们使用个体单核苷酸多态性与 EHR 中表型变量的关联作为输入,构建一个遗传风险评分,该评分量化了表型的遗传贡献。评估了多种构建遗传风险评分的方法,以达到最佳性能。然后,遗传评分与表型相关性一起用作预测因子来插补缺失值。
为了展示方法性能,我们将模型应用于插补缺失的心血管相关测量值,包括电子病历和基因组数据中的低密度脂蛋白、心力衰竭和主动脉瘤疾病。该集成方法提高了二值表型的插补曲线下面积,降低了连续表型的均方根误差。
与标准插补方法相比,纳入遗传信息提供了一种新颖的方法,可以利用更多的 EHR 数据,在缺失数据插补中获得更好的性能。