Department of Industrial, Manufacturing and Systems Engineering, Texas Tech University, Lubbock, TX, United States of America.
PLoS One. 2020 Sep 21;15(9):e0237724. doi: 10.1371/journal.pone.0237724. eCollection 2020.
The wide adoption of electronic health records (EHR) system has provided vast opportunities to advance health care services. However, the prevalence of missing values in EHR system poses a great challenge on data analysis to support clinical decision-making. The objective of this study is to develop a new methodological framework that can address the missing data challenge and provide a reliable tool to predict the hospital readmission among Heart Failure patients.
We used Gaussian Process Latent Variable Model (GPLVM) to impute the missing values. Specifically, a lower dimensional embedding was learned from a small complete dataset and then used to impute the missing values in the incomplete dataset. The GPLVM-based missing data imputation can provide both the mean estimate and the uncertainty associated with the mean estimate. To incorporate the uncertainty in prediction, a constrained support vector machine (cSVM) was developed to obtain robust predictions. We first sampled multiple datasets from the distributions of input uncertainty and trained a support vector machine for each dataset. Then an optimal classifier was identified by selecting the support vectors that maximize the separation margin of a newly sampled dataset and minimize the similarity with the pre-trained support vectors.
The proposed model was derived and validated using Physionet MIMIC-III clinical database. The GPLVM imputation provided normalized mean absolute errors of 0.11 and 0.12 respectively when 20% and 30% of instances contained missing values, and the confidence bounds of the estimations captures 97% of the true values. The cSVM model provided an average Area Under Curve of 0.68, which improves the prediction accuracy by 7% as compared to some existing classifiers.
The proposed method provides accurate imputation of missing values and has a better prediction performance as compared to existing models that can only deal with deterministic inputs.
电子健康记录 (EHR) 系统的广泛采用为提升医疗服务提供了广阔的机会。然而,EHR 系统中缺失值的普遍存在给支持临床决策的数据分析带来了巨大挑战。本研究的目的是开发一种新的方法框架,以解决缺失数据的挑战,并提供一种可靠的工具来预测心力衰竭患者的医院再入院率。
我们使用高斯过程潜在变量模型 (GPLVM) 进行缺失值插补。具体来说,从一个小的完整数据集学习一个低维嵌入,然后用于插补不完整数据集中的缺失值。基于 GPLVM 的缺失值插补可以提供均值估计和与均值估计相关的不确定性。为了在预测中纳入不确定性,开发了约束支持向量机 (cSVM) 以获得稳健的预测。我们首先从输入不确定性的分布中采样多个数据集,并为每个数据集训练一个支持向量机。然后,通过选择最大化新采样数据集的分离边界并最小化与预训练支持向量的相似性的支持向量来确定最优分类器。
该模型是使用 Physionet MIMIC-III 临床数据库推导和验证的。当 20%和 30%的实例包含缺失值时,GPLVM 插补分别提供了归一化平均绝对误差 0.11 和 0.12,并且估计的置信区间捕获了 97%的真实值。cSVM 模型提供了平均 AUC 为 0.68,与一些现有的分类器相比,预测准确性提高了 7%。
与只能处理确定性输入的现有模型相比,该方法提供了缺失值的准确插补,并且具有更好的预测性能。