时间序列健康数据中缺失值的深度插补：综述与基准测试。

Deep imputation of missing values in time series health data: A review with benchmarking.

机构信息

Department of Computer Science, Tennessee State University, Nashville, TN 37209, United States.

出版信息

J Biomed Inform. 2023 Aug;144:104440. doi: 10.1016/j.jbi.2023.104440. Epub 2023 Jul 8.

DOI:10.1016/j.jbi.2023.104440

PMID:37429511

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10529422/

Abstract

The imputation of missing values in multivariate time series (MTS) data is critical in ensuring data quality and producing reliable data-driven predictive models. Apart from many statistical approaches, a few recent studies have proposed state-of-the-art deep learning methods to impute missing values in MTS data. However, the evaluation of these deep methods is limited to one or two data sets, low missing rates, and completely random missing value types. This survey performs six data-centric experiments to benchmark state-of-the-art deep imputation methods on five time series health data sets. Our extensive analysis reveals that no single imputation method outperforms the others on all five data sets. The imputation performance depends on data types, individual variable statistics, missing value rates, and types. Deep learning methods that jointly perform cross-sectional (across variables) and longitudinal (across time) imputations of missing values in time series data yield statistically better data quality than traditional imputation methods. Although computationally expensive, deep learning methods are practical given the current availability of high-performance computing resources, especially when data quality and sample size are of paramount importance in healthcare informatics. Our findings highlight the importance of data-centric selection of imputation methods to optimize data-driven predictive models.

摘要

多元时间序列 (MTS) 数据中缺失值的插补对于确保数据质量和生成可靠的数据驱动预测模型至关重要。除了许多统计方法外，最近的一些研究还提出了最先进的深度学习方法来插补 MTS 数据中的缺失值。然而，这些深度方法的评估仅限于一个或两个数据集、低缺失率和完全随机的缺失值类型。本调查对五个时间序列健康数据集上的最先进的深度插补方法进行了六项数据中心实验，以进行基准测试。我们的广泛分析表明，没有一种插补方法在所有五个数据集上都优于其他方法。插补性能取决于数据类型、个别变量统计、缺失值率和类型。联合执行时间序列数据中缺失值的跨截面 (跨变量) 和纵向 (跨时间) 插补的深度学习方法比传统插补方法具有更好的统计数据质量。尽管计算成本很高，但考虑到当前高性能计算资源的可用性，深度学习方法在医疗保健信息学中数据质量和样本量至关重要的情况下是实用的。我们的研究结果强调了基于数据的插补方法选择的重要性，以优化数据驱动的预测模型。

相似文献

Deep imputation of missing values in time series health data: A review with benchmarking.时间序列健康数据中缺失值的深度插补：综述与基准测试。

J Biomed Inform. 2023 Aug;144:104440. doi: 10.1016/j.jbi.2023.104440. Epub 2023 Jul 8.

Attention-based Imputation of Missing Values in Electronic Health Records Tabular Data.电子健康记录表格数据中基于注意力机制的缺失值插补

Proc (IEEE Int Conf Healthc Inform). 2024 Jun;2024:177-182. doi: 10.1109/ichi61247.2024.00030. Epub 2024 Aug 22.

Missing Value Estimation using Clustering and Deep Learning within Multiple Imputation Framework.在多重填补框架内使用聚类和深度学习进行缺失值估计

Knowl Based Syst. 2022 Aug 5;249. doi: 10.1016/j.knosys.2022.108968. Epub 2022 May 10.

Multiple imputation for non-response when estimating HIV prevalence using survey data.使用调查数据估计艾滋病毒流行率时对无应答情况的多重填补法

BMC Public Health. 2015 Oct 16;15:1059. doi: 10.1186/s12889-015-2390-1.

Benchmarking missing-values approaches for predictive models on health databases.健康数据库中预测模型缺失值处理方法的基准测试

Gigascience. 2022 Apr 15;11. doi: 10.1093/gigascience/giac013.

A deep learning-based, unsupervised method to impute missing values in electronic health records for improved patient management.一种基于深度学习的、无监督的方法，用于填补电子健康记录中的缺失值，以改善患者管理。

J Biomed Inform. 2020 Nov;111:103576. doi: 10.1016/j.jbi.2020.103576. Epub 2020 Oct 1.

Robust imputation method with context-aware voting ensemble model for management of water-quality data.具有上下文感知投票集成模型的稳健插补方法用于水质数据管理。

Water Res. 2023 Sep 1;243:120369. doi: 10.1016/j.watres.2023.120369. Epub 2023 Jul 16.

Selection of statistical technique for imputation of single site-univariate and multisite-multivariate methods for particulate pollutants time series data with long gaps and high missing percentage.单站点单变量和多站点多变量方法在长时间间隔和高缺失率的颗粒物污染物时间序列数据插补中的统计技术选择。

Environ Sci Pollut Res Int. 2023 Jun;30(30):75469-75488. doi: 10.1007/s11356-023-27659-x. Epub 2023 May 23.

Multiple imputation to deal with missing EQ-5D-3L data: Should we impute individual domains or the actual index?采用多重填补法处理EQ-5D-3L数据缺失问题：我们应该填补各个维度还是实际指数？

Qual Life Res. 2015 Apr;24(4):805-15. doi: 10.1007/s11136-014-0837-y. Epub 2014 Dec 4.

Deep Learning Approach for Imputation of Missing Values in Actigraphy Data: Algorithm Development Study.深度学习方法在运动数据缺失值插补中的应用：算法开发研究。

JMIR Mhealth Uhealth. 2020 Jul 23;8(7):e16113. doi: 10.2196/16113.

引用本文的文献

Benchmarking Missing Data Imputation Methods for Time Series Using Real-World Test Cases.使用实际测试案例对时间序列的缺失数据插补方法进行基准测试。

Proc Mach Learn Res. 2025 Jun;287:480-501.

Missing data imputation of climate time series: A review.气候时间序列的缺失数据插补：综述

MethodsX. 2025 Jun 19;15:103455. doi: 10.1016/j.mex.2025.103455. eCollection 2025 Dec.

Sentinel lymph node biopsy provides better regional control than observation in early stage maxillary squamous cell carcinoma.前哨淋巴结活检在早期上颌鳞状细胞癌中比观察能提供更好的区域控制。

Front Oncol. 2025 Jun 24;15:1623502. doi: 10.3389/fonc.2025.1623502. eCollection 2025.

Imaging and sentinel lymph node biopsy in high risk head and neck cutaneous squamous cell carcinoma: a Chinese cohort study.高危头颈部皮肤鳞状细胞癌的影像学检查与前哨淋巴结活检：一项中国队列研究。

Front Oncol. 2025 Jun 12;15:1507137. doi: 10.3389/fonc.2025.1507137. eCollection 2025.

Enhancing Antidiabetic Drug Selection Using Transformers: Machine-Learning Model Development.利用Transformer增强抗糖尿病药物选择：机器学习模型开发

JMIR Med Inform. 2025 Jun 2;13:e67748. doi: 10.2196/67748.

Data-driven ergonomic risk assessment of complex hand-intensive manufacturing processes.复杂的手部密集型制造工艺的数据驱动人体工程学风险评估

Commun Eng. 2025 Mar 12;4(1):45. doi: 10.1038/s44172-025-00382-w.

Augmenting Circadian Biology Research With Data Science.用数据科学增强昼夜节律生物学研究。

J Biol Rhythms. 2025 Apr;40(2):143-170. doi: 10.1177/07487304241310923. Epub 2025 Jan 29.

Predicting rapid decline in kidney function among type 2 diabetes patients: A machine learning approach.预测2型糖尿病患者肾功能的快速下降：一种机器学习方法。

Heliyon. 2024 Nov 22;11(1):e40566. doi: 10.1016/j.heliyon.2024.e40566. eCollection 2025 Jan 15.

Moving Beyond Medical Statistics: A Systematic Review on Missing Data Handling in Electronic Health Records.超越医学统计学：电子健康记录中缺失数据处理的系统评价

Health Data Sci. 2024 Dec 4;4:0176. doi: 10.34133/hds.0176. eCollection 2024.

Forecasting the trend of tuberculosis incidence in Anhui Province based on machine learning optimization algorithm, 2013-2023.基于机器学习优化算法预测 2013-2023 年安徽省肺结核发病率趋势。

BMC Pulm Med. 2024 Oct 26;24(1):536. doi: 10.1186/s12890-024-03296-z.

本文引用的文献

The Health Gym: synthetic health-related datasets for the development of reinforcement learning algorithms.健康健身房：用于开发强化学习算法的综合健康相关数据集。

Sci Data. 2022 Nov 11;9(1):693. doi: 10.1038/s41597-022-01784-7.

Missing Value Estimation using Clustering and Deep Learning within Multiple Imputation Framework.在多重填补框架内使用聚类和深度学习进行缺失值估计

Knowl Based Syst. 2022 Aug 5;249. doi: 10.1016/j.knosys.2022.108968. Epub 2022 May 10.

rECHOmmend: An ECG-Based Machine Learning Approach for Identifying Patients at Increased Risk of Undiagnosed Structural Heart Disease Detectable by Echocardiography.rECHOmmend：一种基于心电图的机器学习方法，用于识别心电图检查可发现但尚未诊断的结构性心脏病风险增加的患者。

Circulation. 2022 Jul 5;146(1):36-47. doi: 10.1161/CIRCULATIONAHA.121.057869. Epub 2022 May 9.

Context-Aware Time Series Imputation for Multi-Analyte Clinical Data.用于多分析物临床数据的上下文感知时间序列插补

J Healthc Inform Res. 2020 Oct 18;4(4):411-426. doi: 10.1007/s41666-020-00075-3. eCollection 2020 Dec.

A Multi-directional Approach for Missing Value Estimation in Multivariate Time Series Clinical Data.一种用于多变量时间序列临床数据中缺失值估计的多方向方法。

J Healthc Inform Res. 2020 Jun 4;4(4):365-382. doi: 10.1007/s41666-020-00076-2. eCollection 2020 Dec.

A Combined Interpolation and Weighted -Nearest Neighbours Approach for the Imputation of Longitudinal ICU Laboratory Data.一种用于纵向重症监护病房实验室数据插补的联合插值与加权最近邻方法。

J Healthc Inform Res. 2020 Mar 2;4(2):174-188. doi: 10.1007/s41666-020-00069-1. eCollection 2020 Jun.

Generative adversarial networks for biomedical time series forecasting and imputation.生成对抗网络在生物医学时间序列预测和插补上的应用。

J Biomed Inform. 2022 May;129:104058. doi: 10.1016/j.jbi.2022.104058. Epub 2022 Mar 25.

Evaluating the state of the art in missing data imputation for clinical data.评估临床数据缺失值插补的最新技术状态。

Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab489.

Early combination of albumin with crystalloids administration might be beneficial for the survival of septic patients: a retrospective analysis from MIMIC-IV database.早期联合使用白蛋白和晶体液给药可能有利于脓毒症患者的生存：一项来自MIMIC-IV数据库的回顾性分析。

Ann Intensive Care. 2021 Mar 10;11(1):42. doi: 10.1186/s13613-021-00830-8.

Predicting Missing Values in Medical Data via XGBoost Regression.通过XGBoost回归预测医学数据中的缺失值。

J Healthc Inform Res. 2020 Dec;4(4):383-394. doi: 10.1007/s41666-020-00077-1. Epub 2020 Aug 3.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验