Suppr超能文献

对于纵向健康指标而言,多少缺失数据量过多而无法进行插补?关于选择使用链式方程多重插补法进行插补的缺失比例范围的初步指南。

How much missing data is too much to impute for longitudinal health indicators? A preliminary guideline for the choice of the extent of missing proportion to impute with multiple imputation by chained equations.

作者信息

Junaid K P, Kiran Tanvi, Gupta Madhu, Kishore Kamal, Siwatch Sujata

机构信息

Department of Community Medicine and School of Public Health, Postgraduate Institute of Medical Education and Research (PGIMER), Chandigarh, India.

Department of Biostatistics, Postgraduate Institute of Medical Education and Research, Chandigarh, India.

出版信息

Popul Health Metr. 2025 Feb 1;23(1):2. doi: 10.1186/s12963-025-00364-2.

Abstract

BACKGROUND

The multiple imputation by chained equations (MICE) is a widely used approach for handling missing data. However, its robustness, especially for high missing proportions in health indicators, is under-researched. The study aimed to provide a preliminary guideline for the choice of the extent of missing proportion to impute longitudinal health-related data using the MICE method.

METHODS

The study obtained complete data on five mortality-related health indicators of 100 countries (2015-2019) from the Global Health Observatory. Nine incomplete datasets with missing rates from 10 to 90% were generated and imputed using MICE. The robustness of MICE was assessed through three approaches: comparison of means using the Repeated Measures- Analysis of variance, estimation of evaluation metrics (Root mean square error, mean absolute deviation, Bias, and proportionate variance), and visual inspection of box plots of imputed and non-imputed data.

RESULTS

The Repeated Measures- Analysis of variance revealed significant differences between complete and imputed data, primarily in imputed data with over 50% missing proportions. Evaluation metrics exhibited 'high performance' for the dataset with a 50% missing proportion for various health indicators However, with missing proportions exceeding 70%, the majority of indicators demonstrated a 'low' performance level in terms of most evaluation metrics. The visual inspection of the box plot revealed severe variance shrinkage in imputed datasets with missing proportions beyond 70%, corroborating the findings from the evaluation metrics.

CONCLUSION

It demonstrates high robustness up to 50% missing values, with marginal deviations from complete datasets. Caution is warranted for missing proportions between 50 and 70%, as moderate alterations are observed. Proportions beyond 70% lead to significant variance shrinkage and compromised data reliability, emphasizing the importance of acknowledging imputation limitations for practical decision-making.

摘要

背景

链式方程多重填补法(MICE)是处理缺失数据的一种广泛使用的方法。然而,其稳健性,尤其是对于健康指标中高缺失比例数据的稳健性,尚未得到充分研究。本研究旨在为使用MICE方法对纵向健康相关数据进行填补时缺失比例的选择提供初步指导。

方法

本研究从全球卫生观测站获取了100个国家(2015 - 2019年)五个与死亡率相关的健康指标的完整数据。生成了九个缺失率从10%到90%的不完整数据集,并使用MICE进行填补。通过三种方法评估MICE的稳健性:使用重复测量方差分析比较均值、评估指标(均方根误差、平均绝对偏差、偏差和比例方差)估计以及对填补数据和未填补数据的箱线图进行可视化检查。

结果

重复测量方差分析显示完整数据和填补数据之间存在显著差异,主要存在于缺失比例超过50%的填补数据中。对于各种健康指标缺失比例为50%的数据集,评估指标表现出“高性能”。然而,当缺失比例超过70%时,大多数指标在大多数评估指标方面表现出“低”性能水平。箱线图的可视化检查显示,缺失比例超过70%的填补数据集中存在严重的方差收缩,这与评估指标的结果一致。

结论

在缺失值高达50%时,它显示出较高的稳健性,与完整数据集的偏差较小。对于50%至70%之间的缺失比例需要谨慎,因为会观察到适度变化。超过70%的比例会导致显著的方差收缩和数据可靠性受损,强调了在实际决策中认识到填补局限性的重要性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验