缺失数据的各种插补算法比较

A comparison of various imputation algorithms for missing data.

作者信息

Kampf Jürgen, Dykun Iryna, Rassaf Tienush, Mahabadi Amir Abbas

机构信息

Department of Cardiology and Vascular Medicine, University Hospital of Essen, Essen, Germany.

出版信息

PLoS One. 2025 May 12;20(5):e0319784. doi: 10.1371/journal.pone.0319784. eCollection 2025.

DOI:10.1371/journal.pone.0319784

PMID:40354495

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12068701/

Abstract

BACKGROUND

Many datasets in medicine and other branches of science are incomplete. In this article we compare various imputation algorithms for missing data.

OBJECTIVES

We take the point of view that it has already been decided that the imputation should be carried out using multiple imputation by chained equation and the only decision left is that of a subroutine for the one-dimensional imputations. The subroutines to be compared are predictive mean matching, weighted predictive mean matching, sampling, classification or regression trees and random forests.

METHODS

We compare these subroutines on real data and on simulated data. We consider the estimation of expected values, variances and coefficients of linear regression models, logistic regression models and Cox regression models. As real data we use data of the survival times after the diagnosis of an obstructive coronary artery disease with systolic blood pressure, LDL, diabetes, smoking behavior and family history of premature heart diseases as variables for which values have to be imputed. While we are mainly interested in statistical properties like biases, mean squared errors or coverage probabilities of confidence intervals, we also have an eye on the computation time.

RESULTS

Weighted predictive mean matching had to be excluded from the statistical comparison due to its enormous computation time. Among the remaining algorithms, in most situations we tested, predictive mean matching performed best.

NOVELTY

This is by far the largest comparison study for subroutines of multiple imputation by chained equations that has been performed up to now.

摘要

背景

医学及其他科学分支中的许多数据集都是不完整的。在本文中，我们比较了用于缺失数据的各种插补算法。

目的

我们的观点是，已经决定应使用链式方程多重插补法进行插补，剩下的唯一决策是一维插补的子例程。要比较的子例程有预测均值匹配、加权预测均值匹配、抽样、分类或回归树以及随机森林。

方法

我们在真实数据和模拟数据上比较这些子例程。我们考虑线性回归模型、逻辑回归模型和Cox回归模型的期望值、方差和系数的估计。作为真实数据，我们使用阻塞性冠状动脉疾病诊断后的生存时间数据，将收缩压、低密度脂蛋白、糖尿病、吸烟行为和早发性心脏病家族史作为需要插补值的变量。虽然我们主要关注偏差、均方误差或置信区间的覆盖概率等统计特性，但我们也关注计算时间。

结果

由于加权预测均值匹配的计算时间过长，不得不将其排除在统计比较之外。在其余算法中，在我们测试的大多数情况下，预测均值匹配表现最佳。

新颖之处

这是迄今为止针对链式方程多重插补子例程所进行的最大规模比较研究。

相似文献

A comparison of various imputation algorithms for missing data.缺失数据的各种插补算法比较

PLoS One. 2025 May 12;20(5):e0319784. doi: 10.1371/journal.pone.0319784. eCollection 2025.

The performance of prognostic models depended on the choice of missing value imputation algorithm: a simulation study.预后模型的性能取决于缺失值插补算法的选择：一项模拟研究。

J Clin Epidemiol. 2024 Dec;176:111539. doi: 10.1016/j.jclinepi.2024.111539. Epub 2024 Sep 24.

Logistic regression vs. predictive mean matching for imputing binary covariates.Logistic 回归与预测均值匹配在二进制协变量插补中的比较。

Stat Methods Med Res. 2023 Nov;32(11):2172-2183. doi: 10.1177/09622802231198795. Epub 2023 Sep 26.

Imputation of missing values of tumour stage in population-based cancer registration.基于人群的癌症登记中肿瘤分期缺失值的推断。

BMC Med Res Methodol. 2011 Sep 19;11:129. doi: 10.1186/1471-2288-11-129.

Multiple imputation with sequential penalized regression.多重插补与序贯惩罚回归。

Stat Methods Med Res. 2019 May;28(5):1311-1327. doi: 10.1177/0962280218755574. Epub 2018 Feb 16.

Assessment of predictive performance in incomplete data by combining internal validation and multiple imputation.通过结合内部验证和多重填补来评估不完整数据中的预测性能。

BMC Med Res Methodol. 2016 Oct 26;16(1):144. doi: 10.1186/s12874-016-0239-7.

Real-time imputation of missing predictor values improved the application of prediction models in daily practice.实时插补缺失预测值可提高预测模型在日常实践中的应用。

J Clin Epidemiol. 2021 Jun;134:22-34. doi: 10.1016/j.jclinepi.2021.01.003. Epub 2021 Jan 19.

A nonparametric multiple imputation approach for missing categorical data.一种针对缺失分类数据的非参数多重填补方法。

BMC Med Res Methodol. 2017 Jun 6;17(1):87. doi: 10.1186/s12874-017-0360-2.

Cox regression analysis with missing covariates via nonparametric multiple imputation.Cox 回归分析中缺失协变量的非参数多重插补法。

Stat Methods Med Res. 2019 Jun;28(6):1676-1688. doi: 10.1177/0962280218772592. Epub 2018 May 2.

Classification of breast cancer recurrence based on imputed data: a simulation study.基于插补数据的乳腺癌复发分类：一项模拟研究。

BioData Min. 2022 Dec 7;15(1):30. doi: 10.1186/s13040-022-00316-8.

本文引用的文献

Evaluation of Multiple Imputation with Large Proportions of Missing Data: How Much Is Too Much?对大量数据缺失情况下多重填补法的评估：多少算过多？

Iran J Public Health. 2021 Jul;50(7):1372-1380. doi: 10.18502/ijph.v50i7.6626.

How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data.处理缺失数据如何影响结论：六种不同的分类问卷数据插补方法的比较

SAGE Open Med. 2019 Jan 8;7:2050312118822912. doi: 10.1177/2050312118822912. eCollection 2019.

Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study.基于 MICE 使用随机森林和参数插补模型比较缺失数据插补：CALIBER 研究。

Am J Epidemiol. 2014 Mar 15;179(6):764-74. doi: 10.1093/aje/kwt312. Epub 2014 Jan 12.

A comparison of incomplete-data methods for categorical data.分类数据不完全数据方法的比较

Stat Methods Med Res. 2016 Apr;25(2):754-74. doi: 10.1177/0962280212465502. Epub 2012 Nov 18.

Multiple imputation for missing data via sequential regression trees.基于序贯回归树的缺失数据多重插补法。

Am J Epidemiol. 2010 Nov 1;172(9):1070-6. doi: 10.1093/aje/kwq260. Epub 2010 Sep 14.

Modelling relative survival in the presence of incomplete data: a tutorial.存在不完全数据时的相对生存模型：教程。

Int J Epidemiol. 2010 Feb;39(1):118-28. doi: 10.1093/ije/dyp309. Epub 2009 Oct 25.

The performance of multiple imputation for missing covariate data within the context of regression relative survival analysis.回归相对生存分析背景下缺失协变量数据的多重填补性能。

Stat Med. 2008 Dec 30;27(30):6310-31. doi: 10.1002/sim.3476.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验