Department of Medical Informatics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran.
BMC Med Res Methodol. 2024 Aug 28;24(1):188. doi: 10.1186/s12874-024-02310-6.
Comprehending the research dataset is crucial for obtaining reliable and valid outcomes. Health analysts must have a deep comprehension of the data being analyzed. This comprehension allows them to suggest practical solutions for handling missing data, in a clinical data source. Accurate handling of missing values is critical for producing precise estimates and making informed decisions, especially in crucial areas like clinical research. With data's increasing diversity and complexity, numerous scholars have developed a range of imputation techniques. To address this, we conducted a systematic review to introduce various imputation techniques based on tabular dataset characteristics, including the mechanism, pattern, and ratio of missingness, to identify the most appropriate imputation methods in the healthcare field.
We searched four information databases namely PubMed, Web of Science, Scopus, and IEEE Xplore, for articles published up to September 20, 2023, that discussed imputation methods for addressing missing values in a clinically structured dataset. Our investigation of selected articles focused on four key aspects: the mechanism, pattern, ratio of missingness, and various imputation strategies. By synthesizing insights from these perspectives, we constructed an evidence map to recommend suitable imputation methods for handling missing values in a tabular dataset.
Out of 2955 articles, 58 were included in the analysis. The findings from the development of the evidence map, based on the structure of the missing values and the types of imputation methods used in the extracted items from these studies, revealed that 45% of the studies employed conventional statistical methods, 31% utilized machine learning and deep learning methods, and 24% applied hybrid imputation techniques for handling missing values.
Considering the structure and characteristics of missing values in a clinical dataset is essential for choosing the most appropriate data imputation technique, especially within conventional statistical methods. Accurately estimating missing values to reflect reality enhances the likelihood of obtaining high-quality and reusable data, contributing significantly to precise medical decision-making processes. Performing this review study creates a guideline for choosing the most appropriate imputation methods in data preprocessing stages to perform analytical processes on structured clinical datasets.
理解研究数据集对于获得可靠和有效的结果至关重要。健康分析师必须对正在分析的数据有深入的理解。这种理解使他们能够为处理临床数据源中的缺失数据提出实际的解决方案。准确处理缺失值对于生成精确的估计和做出明智的决策至关重要,尤其是在临床研究等关键领域。随着数据的多样性和复杂性不断增加,许多学者已经开发了一系列插补技术。为了解决这个问题,我们进行了一项系统评价,根据表格数据集的特点介绍了各种插补技术,包括缺失的机制、模式和比例,以确定在医疗保健领域最适合的插补方法。
我们在 PubMed、Web of Science、Scopus 和 IEEE Xplore 四个信息数据库中搜索了截至 2023 年 9 月 20 日发表的讨论在临床结构化数据集中处理缺失值的插补方法的文章。我们对选定文章的调查重点关注了四个关键方面:机制、模式、缺失比例和各种插补策略。通过综合这些角度的见解,我们构建了一个证据图,以推荐处理表格数据集中缺失值的合适插补方法。
从 2955 篇文章中,有 58 篇被纳入分析。根据从这些研究中提取的项目的缺失值结构和插补方法类型,从证据图的开发中得出的结论表明,45%的研究使用了常规统计方法,31%使用了机器学习和深度学习方法,24%应用了混合插补技术来处理缺失值。
考虑临床数据集中缺失值的结构和特征对于选择最合适的数据插补技术至关重要,特别是在常规统计方法中。准确估计缺失值以反映实际情况有助于获得高质量和可重复使用的数据,这对精确的医疗决策过程有重大贡献。进行这项综述研究为选择最合适的插补方法提供了指导,以在结构化临床数据集的数据预处理阶段执行分析过程。