Suppr超能文献

缺失数据的各种插补算法比较

A comparison of various imputation algorithms for missing data.

作者信息

Kampf Jürgen, Dykun Iryna, Rassaf Tienush, Mahabadi Amir Abbas

机构信息

Department of Cardiology and Vascular Medicine, University Hospital of Essen, Essen, Germany.

出版信息

PLoS One. 2025 May 12;20(5):e0319784. doi: 10.1371/journal.pone.0319784. eCollection 2025.

Abstract

BACKGROUND

Many datasets in medicine and other branches of science are incomplete. In this article we compare various imputation algorithms for missing data.

OBJECTIVES

We take the point of view that it has already been decided that the imputation should be carried out using multiple imputation by chained equation and the only decision left is that of a subroutine for the one-dimensional imputations. The subroutines to be compared are predictive mean matching, weighted predictive mean matching, sampling, classification or regression trees and random forests.

METHODS

We compare these subroutines on real data and on simulated data. We consider the estimation of expected values, variances and coefficients of linear regression models, logistic regression models and Cox regression models. As real data we use data of the survival times after the diagnosis of an obstructive coronary artery disease with systolic blood pressure, LDL, diabetes, smoking behavior and family history of premature heart diseases as variables for which values have to be imputed. While we are mainly interested in statistical properties like biases, mean squared errors or coverage probabilities of confidence intervals, we also have an eye on the computation time.

RESULTS

Weighted predictive mean matching had to be excluded from the statistical comparison due to its enormous computation time. Among the remaining algorithms, in most situations we tested, predictive mean matching performed best.

NOVELTY

This is by far the largest comparison study for subroutines of multiple imputation by chained equations that has been performed up to now.

摘要

背景

医学及其他科学分支中的许多数据集都是不完整的。在本文中,我们比较了用于缺失数据的各种插补算法。

目的

我们的观点是,已经决定应使用链式方程多重插补法进行插补,剩下的唯一决策是一维插补的子例程。要比较的子例程有预测均值匹配、加权预测均值匹配、抽样、分类或回归树以及随机森林。

方法

我们在真实数据和模拟数据上比较这些子例程。我们考虑线性回归模型、逻辑回归模型和Cox回归模型的期望值、方差和系数的估计。作为真实数据,我们使用阻塞性冠状动脉疾病诊断后的生存时间数据,将收缩压、低密度脂蛋白、糖尿病、吸烟行为和早发性心脏病家族史作为需要插补值的变量。虽然我们主要关注偏差、均方误差或置信区间的覆盖概率等统计特性,但我们也关注计算时间。

结果

由于加权预测均值匹配的计算时间过长,不得不将其排除在统计比较之外。在其余算法中,在我们测试的大多数情况下,预测均值匹配表现最佳。

新颖之处

这是迄今为止针对链式方程多重插补子例程所进行的最大规模比较研究。

相似文献

1
A comparison of various imputation algorithms for missing data.
PLoS One. 2025 May 12;20(5):e0319784. doi: 10.1371/journal.pone.0319784. eCollection 2025.
2
The performance of prognostic models depended on the choice of missing value imputation algorithm: a simulation study.
J Clin Epidemiol. 2024 Dec;176:111539. doi: 10.1016/j.jclinepi.2024.111539. Epub 2024 Sep 24.
3
Logistic regression vs. predictive mean matching for imputing binary covariates.
Stat Methods Med Res. 2023 Nov;32(11):2172-2183. doi: 10.1177/09622802231198795. Epub 2023 Sep 26.
4
Imputation of missing values of tumour stage in population-based cancer registration.
BMC Med Res Methodol. 2011 Sep 19;11:129. doi: 10.1186/1471-2288-11-129.
5
Multiple imputation with sequential penalized regression.
Stat Methods Med Res. 2019 May;28(5):1311-1327. doi: 10.1177/0962280218755574. Epub 2018 Feb 16.
6
Assessment of predictive performance in incomplete data by combining internal validation and multiple imputation.
BMC Med Res Methodol. 2016 Oct 26;16(1):144. doi: 10.1186/s12874-016-0239-7.
7
Real-time imputation of missing predictor values improved the application of prediction models in daily practice.
J Clin Epidemiol. 2021 Jun;134:22-34. doi: 10.1016/j.jclinepi.2021.01.003. Epub 2021 Jan 19.
8
A nonparametric multiple imputation approach for missing categorical data.
BMC Med Res Methodol. 2017 Jun 6;17(1):87. doi: 10.1186/s12874-017-0360-2.
9
Cox regression analysis with missing covariates via nonparametric multiple imputation.
Stat Methods Med Res. 2019 Jun;28(6):1676-1688. doi: 10.1177/0962280218772592. Epub 2018 May 2.
10
Classification of breast cancer recurrence based on imputed data: a simulation study.
BioData Min. 2022 Dec 7;15(1):30. doi: 10.1186/s13040-022-00316-8.

本文引用的文献

1
Evaluation of Multiple Imputation with Large Proportions of Missing Data: How Much Is Too Much?
Iran J Public Health. 2021 Jul;50(7):1372-1380. doi: 10.18502/ijph.v50i7.6626.
2
How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data.
SAGE Open Med. 2019 Jan 8;7:2050312118822912. doi: 10.1177/2050312118822912. eCollection 2019.
3
Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study.
Am J Epidemiol. 2014 Mar 15;179(6):764-74. doi: 10.1093/aje/kwt312. Epub 2014 Jan 12.
4
A comparison of incomplete-data methods for categorical data.
Stat Methods Med Res. 2016 Apr;25(2):754-74. doi: 10.1177/0962280212465502. Epub 2012 Nov 18.
5
Multiple imputation for missing data via sequential regression trees.
Am J Epidemiol. 2010 Nov 1;172(9):1070-6. doi: 10.1093/aje/kwq260. Epub 2010 Sep 14.
6
Modelling relative survival in the presence of incomplete data: a tutorial.
Int J Epidemiol. 2010 Feb;39(1):118-28. doi: 10.1093/ije/dyp309. Epub 2009 Oct 25.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验