Kampf Jürgen, Dykun Iryna, Rassaf Tienush, Mahabadi Amir Abbas
Department of Cardiology and Vascular Medicine, University Hospital of Essen, Essen, Germany.
PLoS One. 2025 May 12;20(5):e0319784. doi: 10.1371/journal.pone.0319784. eCollection 2025.
Many datasets in medicine and other branches of science are incomplete. In this article we compare various imputation algorithms for missing data.
We take the point of view that it has already been decided that the imputation should be carried out using multiple imputation by chained equation and the only decision left is that of a subroutine for the one-dimensional imputations. The subroutines to be compared are predictive mean matching, weighted predictive mean matching, sampling, classification or regression trees and random forests.
We compare these subroutines on real data and on simulated data. We consider the estimation of expected values, variances and coefficients of linear regression models, logistic regression models and Cox regression models. As real data we use data of the survival times after the diagnosis of an obstructive coronary artery disease with systolic blood pressure, LDL, diabetes, smoking behavior and family history of premature heart diseases as variables for which values have to be imputed. While we are mainly interested in statistical properties like biases, mean squared errors or coverage probabilities of confidence intervals, we also have an eye on the computation time.
Weighted predictive mean matching had to be excluded from the statistical comparison due to its enormous computation time. Among the remaining algorithms, in most situations we tested, predictive mean matching performed best.
This is by far the largest comparison study for subroutines of multiple imputation by chained equations that has been performed up to now.
医学及其他科学分支中的许多数据集都是不完整的。在本文中,我们比较了用于缺失数据的各种插补算法。
我们的观点是,已经决定应使用链式方程多重插补法进行插补,剩下的唯一决策是一维插补的子例程。要比较的子例程有预测均值匹配、加权预测均值匹配、抽样、分类或回归树以及随机森林。
我们在真实数据和模拟数据上比较这些子例程。我们考虑线性回归模型、逻辑回归模型和Cox回归模型的期望值、方差和系数的估计。作为真实数据,我们使用阻塞性冠状动脉疾病诊断后的生存时间数据,将收缩压、低密度脂蛋白、糖尿病、吸烟行为和早发性心脏病家族史作为需要插补值的变量。虽然我们主要关注偏差、均方误差或置信区间的覆盖概率等统计特性,但我们也关注计算时间。
由于加权预测均值匹配的计算时间过长,不得不将其排除在统计比较之外。在其余算法中,在我们测试的大多数情况下,预测均值匹配表现最佳。
这是迄今为止针对链式方程多重插补子例程所进行的最大规模比较研究。