识别处理临床结构化数据集缺失值的最合适插补方法：系统评价。

Identify the most appropriate imputation method for handling missing values in clinical structured datasets: a systematic review.

机构信息

Department of Medical Informatics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran.

出版信息

BMC Med Res Methodol. 2024 Aug 28;24(1):188. doi: 10.1186/s12874-024-02310-6.

DOI:10.1186/s12874-024-02310-6

PMID:39198744

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11351057/

Abstract

BACKGROUND AND OBJECTIVES

Comprehending the research dataset is crucial for obtaining reliable and valid outcomes. Health analysts must have a deep comprehension of the data being analyzed. This comprehension allows them to suggest practical solutions for handling missing data, in a clinical data source. Accurate handling of missing values is critical for producing precise estimates and making informed decisions, especially in crucial areas like clinical research. With data's increasing diversity and complexity, numerous scholars have developed a range of imputation techniques. To address this, we conducted a systematic review to introduce various imputation techniques based on tabular dataset characteristics, including the mechanism, pattern, and ratio of missingness, to identify the most appropriate imputation methods in the healthcare field.

MATERIALS AND METHODS

We searched four information databases namely PubMed, Web of Science, Scopus, and IEEE Xplore, for articles published up to September 20, 2023, that discussed imputation methods for addressing missing values in a clinically structured dataset. Our investigation of selected articles focused on four key aspects: the mechanism, pattern, ratio of missingness, and various imputation strategies. By synthesizing insights from these perspectives, we constructed an evidence map to recommend suitable imputation methods for handling missing values in a tabular dataset.

RESULTS

Out of 2955 articles, 58 were included in the analysis. The findings from the development of the evidence map, based on the structure of the missing values and the types of imputation methods used in the extracted items from these studies, revealed that 45% of the studies employed conventional statistical methods, 31% utilized machine learning and deep learning methods, and 24% applied hybrid imputation techniques for handling missing values.

CONCLUSION

Considering the structure and characteristics of missing values in a clinical dataset is essential for choosing the most appropriate data imputation technique, especially within conventional statistical methods. Accurately estimating missing values to reflect reality enhances the likelihood of obtaining high-quality and reusable data, contributing significantly to precise medical decision-making processes. Performing this review study creates a guideline for choosing the most appropriate imputation methods in data preprocessing stages to perform analytical processes on structured clinical datasets.

摘要

背景和目的

理解研究数据集对于获得可靠和有效的结果至关重要。健康分析师必须对正在分析的数据有深入的理解。这种理解使他们能够为处理临床数据源中的缺失数据提出实际的解决方案。准确处理缺失值对于生成精确的估计和做出明智的决策至关重要，尤其是在临床研究等关键领域。随着数据的多样性和复杂性不断增加，许多学者已经开发了一系列插补技术。为了解决这个问题，我们进行了一项系统评价，根据表格数据集的特点介绍了各种插补技术，包括缺失的机制、模式和比例，以确定在医疗保健领域最适合的插补方法。

材料和方法

我们在 PubMed、Web of Science、Scopus 和 IEEE Xplore 四个信息数据库中搜索了截至 2023 年 9 月 20 日发表的讨论在临床结构化数据集中处理缺失值的插补方法的文章。我们对选定文章的调查重点关注了四个关键方面：机制、模式、缺失比例和各种插补策略。通过综合这些角度的见解，我们构建了一个证据图，以推荐处理表格数据集中缺失值的合适插补方法。

结果

从 2955 篇文章中，有 58 篇被纳入分析。根据从这些研究中提取的项目的缺失值结构和插补方法类型，从证据图的开发中得出的结论表明，45%的研究使用了常规统计方法，31%使用了机器学习和深度学习方法，24%应用了混合插补技术来处理缺失值。

结论

考虑临床数据集中缺失值的结构和特征对于选择最合适的数据插补技术至关重要，特别是在常规统计方法中。准确估计缺失值以反映实际情况有助于获得高质量和可重复使用的数据，这对精确的医疗决策过程有重大贡献。进行这项综述研究为选择最合适的插补方法提供了指导，以在结构化临床数据集的数据预处理阶段执行分析过程。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/23ce/11351057/aacaf7fb8597/12874_2024_2310_Fig1_HTML.jpg

相似文献

Identify the most appropriate imputation method for handling missing values in clinical structured datasets: a systematic review.识别处理临床结构化数据集缺失值的最合适插补方法：系统评价。

BMC Med Res Methodol. 2024 Aug 28;24(1):188. doi: 10.1186/s12874-024-02310-6.

Handling missing values in healthcare data: A systematic review of deep learning-based imputation techniques.处理医疗保健数据中的缺失值：基于深度学习的插补技术的系统评价。

Artif Intell Med. 2023 Aug;142:102587. doi: 10.1016/j.artmed.2023.102587. Epub 2023 May 22.

Robust imputation method with context-aware voting ensemble model for management of water-quality data.具有上下文感知投票集成模型的稳健插补方法用于水质数据管理。

Water Res. 2023 Sep 1;243:120369. doi: 10.1016/j.watres.2023.120369. Epub 2023 Jul 16.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Addressing missing values in routine health information system data: an evaluation of imputation methods using data from the Democratic Republic of the Congo during the COVID-19 pandemic.解决常规卫生信息系统数据中的缺失值问题：使用刚果民主共和国在 COVID-19 大流行期间的数据评估插补方法。

Popul Health Metr. 2021 Nov 4;19(1):44. doi: 10.1186/s12963-021-00274-z.

Multiple imputation for handling missing outcome data when estimating the relative risk.采用多重插补处理估计相对危险度时丢失的结局数据。

BMC Med Res Methodol. 2017 Sep 6;17(1):134. doi: 10.1186/s12874-017-0414-5.

Comparison of imputation methods for handling missing covariate data when fitting a Cox proportional hazards model: a resampling study.缺失协变量数据处理的填补方法在 Cox 比例风险模型拟合中的比较：重抽样研究。

BMC Med Res Methodol. 2010 Dec 31;10:112. doi: 10.1186/1471-2288-10-112.

A Realistic Evaluation of Methods for Handling Missing Data When There is a Mixture of MCAR, MAR, and MNAR Mechanisms in the Same Dataset.当同一数据集中存在MCAR、MAR和MNAR机制混合时处理缺失数据方法的现实评估

Multivariate Behav Res. 2023 Sep-Oct;58(5):988-1013. doi: 10.1080/00273171.2022.2158776. Epub 2023 Jan 4.

Missing value imputation in high-dimensional phenomic data: imputable or not, and how?高维表型组数据中的缺失值插补：是否可插补以及如何插补？

BMC Bioinformatics. 2014 Nov 5;15(1):346. doi: 10.1186/s12859-014-0346-6.

Outcome-sensitive multiple imputation: a simulation study.结果敏感多重填补：一项模拟研究。

BMC Med Res Methodol. 2017 Jan 9;17(1):2. doi: 10.1186/s12874-016-0281-5.

引用本文的文献

Fully-connected network-based prediction model for lymph node metastasis in clinical early-stage endometrial cancer: development and validation in two centers.基于全连接网络的临床早期子宫内膜癌淋巴结转移预测模型：在两个中心的开发与验证

Front Oncol. 2025 Aug 25;15:1627662. doi: 10.3389/fonc.2025.1627662. eCollection 2025.

Dynamic Modeling and System Identification of User Engagement in mHealth Interventions using a Bayesian Approach for Missing Data Imputation.使用贝叶斯方法进行缺失数据插补的移动健康干预中用户参与度的动态建模与系统识别

Control Eng Pract. 2025 Nov;164. doi: 10.1016/j.conengprac.2025.106460. Epub 2025 Jun 28.

Evaluating predictive performance, validity, and applicability of machine learning models for predicting HIV treatment interruption: a systematic review.评估用于预测HIV治疗中断的机器学习模型的预测性能、有效性和适用性：一项系统综述

BMC Glob Public Health. 2025 Jul 24;3(1):64. doi: 10.1186/s44263-025-00184-4.

Missing data imputation of climate time series: A review.气候时间序列的缺失数据插补：综述

MethodsX. 2025 Jun 19;15:103455. doi: 10.1016/j.mex.2025.103455. eCollection 2025 Dec.

Non-linear relationship between platelet count and 30-day in-hospital mortality in ICU patients with acute myocardial infarction: a multicenter retrospective cohort study.急性心肌梗死ICU患者血小板计数与30天院内死亡率的非线性关系：一项多中心回顾性队列研究

Sci Rep. 2025 Jul 1;15(1):21821. doi: 10.1038/s41598-025-06317-x.

Cancer incidence data at the ZIP Code Tabulation Area level in the United States interpolated by Monte Carlo simulation with multiple constraints.美国邮政编码分区层面的癌症发病率数据，通过具有多重约束的蒙特卡洛模拟进行插值。

Sci Data. 2025 May 30;12(1):909. doi: 10.1038/s41597-025-05254-8.

Optimizing in-hospital mortality predictive models in ACS patients: QTc prolongation and machine learning approaches.优化急性冠状动脉综合征患者的院内死亡预测模型：QTc延长与机器学习方法

Egypt Heart J. 2025 Apr 19;77(1):38. doi: 10.1186/s43044-025-00639-x.

Machine learning analysis of cardiovascular risk factors and their associations with hearing loss.心血管危险因素及其与听力损失关联的机器学习分析

Sci Rep. 2025 Mar 22;15(1):9944. doi: 10.1038/s41598-025-94253-1.

Spatial analysis of air pollutant exposure and its association with metabolic diseases using machine learning.利用机器学习对空气污染物暴露进行空间分析及其与代谢性疾病的关联

BMC Public Health. 2025 Mar 1;25(1):831. doi: 10.1186/s12889-025-22077-9.

Conceptual framework as a guide to choose appropriate imputation method for missing values in a clinical structured dataset.概念框架作为选择临床结构化数据集中缺失值的适当插补方法的指南。

BMC Med Res Methodol. 2025 Feb 20;25(1):43. doi: 10.1186/s12874-025-02496-3.

本文引用的文献

Extremely missing numerical data in Electronic Health Records for machine learning can be managed through simple imputation methods considering informative missingness: A comparative of solutions in a COVID-19 mortality case study.在电子健康记录中，针对机器学习的极度缺失数值数据可以通过考虑信息性缺失的简单插补方法来处理：一项关于COVID-19死亡率案例研究中各种解决方案的比较

Comput Methods Programs Biomed. 2023 Dec;242:107803. doi: 10.1016/j.cmpb.2023.107803. Epub 2023 Sep 7.

Bayesian causal inference for observational studies with missingness in covariates and outcomes.贝叶斯因果推断在协变量和结局缺失的观察性研究中的应用。

Biometrics. 2023 Dec;79(4):3624-3636. doi: 10.1111/biom.13918. Epub 2023 Aug 8.

A novel missing data imputation approach based on clinical conditional Generative Adversarial Networks applied to EHR datasets.基于临床条件生成对抗网络的新型缺失数据插补方法在电子健康记录数据集的应用。

Comput Biol Med. 2023 Sep;163:107188. doi: 10.1016/j.compbiomed.2023.107188. Epub 2023 Jun 22.

Artif Intell Med. 2023 Aug;142:102587. doi: 10.1016/j.artmed.2023.102587. Epub 2023 May 22.

A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative.一种比较多重插补技术的方法：以美国国家 COVID 队列协作研究为例。

J Biomed Inform. 2023 Mar;139:104295. doi: 10.1016/j.jbi.2023.104295. Epub 2023 Jan 27.

Classification of breast cancer recurrence based on imputed data: a simulation study.基于插补数据的乳腺癌复发分类：一项模拟研究。

BioData Min. 2022 Dec 7;15(1):30. doi: 10.1186/s13040-022-00316-8.

Missing data imputation using utility-based regression and sampling approaches.基于效用的回归和抽样方法进行缺失数据插补。

Comput Methods Programs Biomed. 2022 Nov;226:107172. doi: 10.1016/j.cmpb.2022.107172. Epub 2022 Oct 3.

A "smart" Imputation Approach for Effective Quality Control Across Complex Clinical Data Structures.一种用于复杂临床数据结构中有效质量控制的“智能”插补方法。

Annu Int Conf IEEE Eng Med Biol Soc. 2022 Jul;2022:1049-1052. doi: 10.1109/EMBC48229.2022.9871919.

Non-linear missing data imputation for healthcare data via index-aware autoencoders.基于索引感知自动编码器的医疗保健数据非线性缺失数据插补。

Health Care Manag Sci. 2022 Sep;25(3):484-497. doi: 10.1007/s10729-022-09597-1. Epub 2022 Jun 23.

Mechanism-aware imputation: a two-step approach in handling missing values in metabolomics.基于机制的插补：代谢组学中处理缺失值的两步法。

BMC Bioinformatics. 2022 May 16;23(1):179. doi: 10.1186/s12859-022-04659-1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

识别处理临床结构化数据集缺失值的最合适插补方法：系统评价。

Identify the most appropriate imputation method for handling missing values in clinical structured datasets: a systematic review.

机构信息

出版信息

BACKGROUND AND OBJECTIVES

MATERIALS AND METHODS

RESULTS

CONCLUSION

背景和目的

材料和方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献