Deforth Manja, Heinze Georg, Held Ulrike
Department of Biostatistics at the Epidemiology, Biostatistics and Prevention Institute, University of Zurich, Zurich, Switzerland.
Center for Medical Data Science, Institute of Clinical Biometrics, Medical University of Vienna, Vienna, Austria.
J Clin Epidemiol. 2024 Dec;176:111539. doi: 10.1016/j.jclinepi.2024.111539. Epub 2024 Sep 24.
The development of clinical prediction models is often impeded by the occurrence of missing values in the predictors. Various methods for imputing missing values before modeling have been proposed. Some of them are based on variants of multiple imputations by chained equations, while others are based on single imputation. These methods may include elements of flexible modeling or machine learning algorithms, and for some of them user-friendly software packages are available. The aim of this study was to investigate by simulation if some of these methods consistently outperform others in performance measures of clinical prediction models.
We simulated development and validation cohorts by mimicking observed distributions of predictors and outcome variable of a real data set. In the development cohorts, missing predictor values were created in 36 scenarios defined by the missingness mechanism and proportion of noncomplete cases. We applied three imputation algorithms that were available in R software (R Foundation for Statistical Computing, Vienna, Austria): mice, aregImpute, and missForest. These algorithms differed in their use of linear or flexible models, or random forests, the way of sampling from the predictive posterior distribution, and the generation of a single or multiple imputed data set. For multiple imputation, we also investigated the impact of the number of imputations. Logistic regression models were fitted with the simulated development cohorts before (full data analysis) and after missing value generation (complete case analysis), and with the imputed data. Prognostic model performance was measured by the scaled Brier score, c-statistic, calibration intercept and slope, and by the mean absolute prediction error evaluated in validation cohorts without missing values. Performance of full data analysis was considered as ideal.
None of the imputation methods achieved the model's predictive accuracy that would be obtained in case of no missingness. In general, complete case analysis yielded the worst performance, and deviation from ideal performance increased with increasing percentage of missingness and decreasing sample size. Across all scenarios and performance measures, aregImpute and mice, both with 100 imputations, resulted in highest predictive accuracy. Surprisingly, aregImpute outperformed full data analysis in achieving calibration slopes very close to one across all scenarios and outcome models. The increase of mice's performance with 100 compared to five imputations was only marginal. The differences between the imputation methods decreased with increasing sample sizes and decreasing proportion of noncomplete cases.
In our simulation study, model calibration was more affected by the choice of the imputation method than model discrimination. While differences in model performance after using imputation methods were generally small, multiple imputation methods as mice and aregImpute that can handle linear or nonlinear associations between predictors and outcome are an attractive and reliable choice in most situations.
临床预测模型的开发常常受到预测变量中缺失值出现的阻碍。已经提出了多种在建模前插补缺失值的方法。其中一些基于链式方程多重插补的变体,而其他的则基于单一插补。这些方法可能包括灵活建模或机器学习算法的元素,并且其中一些有用户友好的软件包可用。本研究的目的是通过模拟来调查这些方法中的一些在临床预测模型的性能指标方面是否始终优于其他方法。
我们通过模仿一个真实数据集的预测变量和结局变量的观察分布来模拟开发和验证队列。在开发队列中,在由缺失机制和非完整病例比例定义的36种情况下创建缺失的预测变量值。我们应用了R软件(奥地利维也纳的R统计计算基金会)中可用的三种插补算法:mice、aregImpute和missForest。这些算法在使用线性或灵活模型、或随机森林、从预测后验分布进行抽样的方式以及生成单个或多个插补数据集方面有所不同。对于多重插补,我们还研究了插补次数的影响。在缺失值生成之前(完整数据分析)和之后(完整病例分析)以及使用插补数据对模拟的开发队列拟合逻辑回归模型。通过缩放后的Brier评分、c统计量、校准截距和斜率以及在无缺失值的验证队列中评估的平均绝对预测误差来衡量预后模型的性能。完整数据分析的性能被视为理想性能。
没有一种插补方法能达到在无缺失值情况下获得的模型预测准确性。一般来说,完整病例分析产生的性能最差,并且与理想性能的偏差随着缺失值百分比的增加和样本量的减少而增加。在所有情况和性能指标中,进行100次插补的aregImpute和mice产生了最高的预测准确性。令人惊讶的是,在所有情况和结局模型中,aregImpute在实现非常接近1的校准斜率方面优于完整数据分析。与5次插补相比,mice进行100次插补时性能的提升仅微不足道。插补方法之间的差异随着样本量的增加和非完整病例比例的减少而减小。
在我们的模拟研究中,模型校准比模型区分度受插补方法选择的影响更大。虽然使用插补方法后模型性能的差异通常较小,但像mice和aregImpute这样能够处理预测变量与结局之间线性或非线性关联的多重插补方法在大多数情况下是有吸引力且可靠的选择。