Department of Health Sciences, 1848Northeastern University, Boston, MA, USA.
Stat Methods Med Res. 2022 Oct;31(10):1904-1915. doi: 10.1177/09622802221104238. Epub 2022 Jun 5.
Multiple imputation techniques are commonly used when data are missing, however, there are many options one can consider. Multivariate imputation by chained equations is a popular method for generating imputations but relies on specifying models when imputing missing values. In this work, we introduce multiple imputation by super learning, an update to the multivariate imputation by chained equations method to generate imputations with ensemble learning. Ensemble methodologies have recently gained attention for use in inference and prediction as they optimally combine a variety of user-specified parametric and non-parametric models and perform well when estimating complex functions, including those with interaction terms. Through two simulations we compare inferences made using the multiple imputation by super learning approach to those made with other commonly used multiple imputation methods and demonstrate multiple imputation by super learning as a superior option when considering characteristics such as bias, confidence interval coverage rate, and confidence interval width.
当数据缺失时,通常会使用多种插补技术,但是有很多选项可供考虑。链式方程的多变量插补是一种生成插补值的常用方法,但在插补缺失值时需要指定模型。在这项工作中,我们引入了超级学习的多变量插补,这是对链式方程多变量插补方法的更新,使用集成学习生成插补值。集成方法最近因其在推理和预测中的应用而受到关注,因为它们可以最优地组合各种用户指定的参数和非参数模型,并且在估计复杂函数(包括具有交互项的函数)时表现良好。通过两个模拟,我们比较了使用超级学习的多变量插补方法进行推断与使用其他常用的多变量插补方法进行推断的结果,并证明了当考虑偏倚、置信区间覆盖率和置信区间宽度等特征时,超级学习的多变量插补是一种更好的选择。