Rahman Shah Atiqur, Huang Yuxiao, Claassen Jan, Heintzman Nathaniel, Kleinberg Samantha
Department of Computer Science, Stevens Institute of Technology, NJ, United States.
Division of Critical Care Neurology, Department of Neurology, Columbia University, College of Physicians and Surgeons, New York, NY, United States.
J Biomed Inform. 2015 Dec;58:198-207. doi: 10.1016/j.jbi.2015.10.004. Epub 2015 Oct 21.
Most clinical and biomedical data contain missing values. A patient's record may be split across multiple institutions, devices may fail, and sensors may not be worn at all times. While these missing values are often ignored, this can lead to bias and error when the data are mined. Further, the data are not simply missing at random. Instead the measurement of a variable such as blood glucose may depend on its prior values as well as that of other variables. These dependencies exist across time as well, but current methods have yet to incorporate these temporal relationships as well as multiple types of missingness. To address this, we propose an imputation method (FLk-NN) that incorporates time lagged correlations both within and across variables by combining two imputation methods, based on an extension to k-NN and the Fourier transform. This enables imputation of missing values even when all data at a time point is missing and when there are different types of missingness both within and across variables. In comparison to other approaches on three biological datasets (simulated and actual Type 1 diabetes datasets, and multi-modality neurological ICU monitoring) the proposed method has the highest imputation accuracy. This was true for up to half the data being missing and when consecutive missing values are a significant fraction of the overall time series length.
大多数临床和生物医学数据都包含缺失值。患者的记录可能分散在多个机构,设备可能出现故障,传感器也可能并非一直佩戴。虽然这些缺失值常常被忽略,但在挖掘数据时这可能会导致偏差和错误。此外,数据并非简单地随机缺失。相反,诸如血糖等变量的测量可能取决于其先前的值以及其他变量的值。这些依赖关系在时间上也存在,但当前的方法尚未纳入这些时间关系以及多种类型的缺失情况。为了解决这个问题,我们提出了一种插补方法(FLk-NN),该方法通过结合两种插补方法,基于对k近邻法(k-NN)的扩展和傅里叶变换,纳入变量内部和变量之间的时间滞后相关性。这使得即使在某个时间点所有数据都缺失以及变量内部和变量之间存在不同类型的缺失时,也能够对缺失值进行插补。与在三个生物数据集(模拟和实际的1型糖尿病数据集以及多模态神经重症监护病房监测数据)上的其他方法相比,所提出的方法具有最高的插补精度。当高达一半的数据缺失以及连续缺失值占整个时间序列长度的很大一部分时,情况都是如此。