Department of Computer Science, Tennessee State University, Nashville, TN 37209, United States.
Department of Computer Science, Tennessee State University, Nashville, TN 37209, United States.
J Biomed Inform. 2023 Aug;144:104440. doi: 10.1016/j.jbi.2023.104440. Epub 2023 Jul 8.
The imputation of missing values in multivariate time series (MTS) data is critical in ensuring data quality and producing reliable data-driven predictive models. Apart from many statistical approaches, a few recent studies have proposed state-of-the-art deep learning methods to impute missing values in MTS data. However, the evaluation of these deep methods is limited to one or two data sets, low missing rates, and completely random missing value types. This survey performs six data-centric experiments to benchmark state-of-the-art deep imputation methods on five time series health data sets. Our extensive analysis reveals that no single imputation method outperforms the others on all five data sets. The imputation performance depends on data types, individual variable statistics, missing value rates, and types. Deep learning methods that jointly perform cross-sectional (across variables) and longitudinal (across time) imputations of missing values in time series data yield statistically better data quality than traditional imputation methods. Although computationally expensive, deep learning methods are practical given the current availability of high-performance computing resources, especially when data quality and sample size are of paramount importance in healthcare informatics. Our findings highlight the importance of data-centric selection of imputation methods to optimize data-driven predictive models.
多元时间序列 (MTS) 数据中缺失值的插补对于确保数据质量和生成可靠的数据驱动预测模型至关重要。除了许多统计方法外,最近的一些研究还提出了最先进的深度学习方法来插补 MTS 数据中的缺失值。然而,这些深度方法的评估仅限于一个或两个数据集、低缺失率和完全随机的缺失值类型。本调查对五个时间序列健康数据集上的最先进的深度插补方法进行了六项数据中心实验,以进行基准测试。我们的广泛分析表明,没有一种插补方法在所有五个数据集上都优于其他方法。插补性能取决于数据类型、个别变量统计、缺失值率和类型。联合执行时间序列数据中缺失值的跨截面 (跨变量) 和纵向 (跨时间) 插补的深度学习方法比传统插补方法具有更好的统计数据质量。尽管计算成本很高,但考虑到当前高性能计算资源的可用性,深度学习方法在医疗保健信息学中数据质量和样本量至关重要的情况下是实用的。我们的研究结果强调了基于数据的插补方法选择的重要性,以优化数据驱动的预测模型。