Department of Preventive Medicine, Northwestern University, Chicago, IL, USA.
Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, MA, USA.
J Am Med Inform Assoc. 2018 Jun 1;25(6):645-653. doi: 10.1093/jamia/ocx133.
A key challenge in clinical data mining is that most clinical datasets contain missing data. Since many commonly used machine learning algorithms require complete datasets (no missing data), clinical analytic approaches often entail an imputation procedure to "fill in" missing data. However, although most clinical datasets contain a temporal component, most commonly used imputation methods do not adequately accommodate longitudinal time-based data. We sought to develop a new imputation algorithm, 3-dimensional multiple imputation with chained equations (3D-MICE), that can perform accurate imputation of missing clinical time series data.
We extracted clinical laboratory test results for 13 commonly measured analytes (clinical laboratory tests). We imputed missing test results for the 13 analytes using 3 imputation methods: multiple imputation with chained equations (MICE), Gaussian process (GP), and 3D-MICE. 3D-MICE utilizes both MICE and GP imputation to integrate cross-sectional and longitudinal information. To evaluate imputation method performance, we randomly masked selected test results and imputed these masked results alongside results missing from our original data. We compared predicted results to measured results for masked data points.
3D-MICE performed significantly better than MICE and GP-based imputation in a composite of all 13 analytes, predicting missing results with a normalized root-mean-square error of 0.342, compared to 0.373 for MICE alone and 0.358 for GP alone.
3D-MICE offers a novel and practical approach to imputing clinical laboratory time series data. 3D-MICE may provide an additional tool for use as a foundation in clinical predictive analytics and intelligent clinical decision support.
临床数据挖掘的一个关键挑战是,大多数临床数据集都包含缺失数据。由于许多常用的机器学习算法需要完整的数据集(没有缺失数据),因此临床分析方法通常需要采用一种插补程序来“填补”缺失数据。然而,尽管大多数临床数据集都包含时间成分,但大多数常用的插补方法并不能充分适应基于时间的纵向数据。我们试图开发一种新的插补算法,即三维链式方程多重插补(3D-MICE),以准确插补缺失的临床时间序列数据。
我们提取了 13 种常用分析物(临床实验室检验)的临床实验室检验结果。我们使用 3 种插补方法,即链式方程多重插补(MICE)、高斯过程(GP)和 3D-MICE,对 13 种分析物的缺失检验结果进行插补。3D-MICE 利用 MICE 和 GP 插补来整合横断面和纵向信息。为了评估插补方法的性能,我们随机屏蔽了选定的检验结果,并与原始数据中缺失的结果一起插补这些屏蔽的结果。我们将预测结果与屏蔽数据点的实测结果进行比较。
3D-MICE 在所有 13 种分析物的综合表现明显优于 MICE 和基于 GP 的插补,预测缺失结果的归一化均方根误差为 0.342,而单独使用 MICE 的为 0.373,单独使用 GP 的为 0.358。
3D-MICE 为插补临床实验室时间序列数据提供了一种新颖而实用的方法。3D-MICE 可能为临床预测分析和智能临床决策支持提供一种额外的工具。