Clinical Epidemiology and Biostatistics Unit, Murdoch Children's Research Institute, Parkville, Victoria, Australia.
Department of Paediatrics, The University of Melbourne, Parkville, Victoria, Australia.
Biom J. 2024 Jan;66(1):e2200291. doi: 10.1002/bimj.202200291.
Multiple imputation (MI) is a popular method for handling missing data. Auxiliary variables can be added to the imputation model(s) to improve MI estimates. However, the choice of which auxiliary variables to include is not always straightforward. Several data-driven auxiliary variable selection strategies have been proposed, but there has been limited evaluation of their performance. Using a simulation study we evaluated the performance of eight auxiliary variable selection strategies: (1, 2) two versions of selection based on correlations in the observed data; (3) selection using hypothesis tests of the "missing completely at random" assumption; (4) replacing auxiliary variables with their principal components; (5, 6) forward and forward stepwise selection; (7) forward selection based on the estimated fraction of missing information; and (8) selection via the least absolute shrinkage and selection operator (LASSO). A complete case analysis and an MI analysis using all auxiliary variables (the "full model") were included for comparison. We also applied all strategies to a motivating case study. The full model outperformed all auxiliary variable selection strategies in the simulation study, with the LASSO strategy the best performing auxiliary variable selection strategy overall. All MI analysis strategies that we were able to apply to the case study led to similar estimates, although computational time was substantially reduced when variable selection was employed. This study provides further support for adopting an inclusive auxiliary variable strategy where possible. Auxiliary variable selection using the LASSO may be a promising alternative when the full model fails or is too burdensome.
多重插补(MI)是处理缺失数据的一种常用方法。可以向插补模型中添加辅助变量以提高 MI 估计值。但是,选择包含哪些辅助变量并不总是那么简单。已经提出了几种基于数据的辅助变量选择策略,但对其性能的评估有限。我们使用模拟研究评估了八种辅助变量选择策略的性能:(1)两种基于观测数据相关性的选择版本;(2)基于“完全随机缺失”假设的假设检验的选择;(3)用辅助变量的主成分替换辅助变量;(4)向前和逐步向前选择;(5)基于估计缺失信息量的分数的向前选择;(6)基于最小绝对收缩和选择算子(LASSO)的选择。为了进行比较,还包括完全案例分析和使用所有辅助变量的 MI 分析(“完整模型”)。我们还将所有策略应用于一个动机案例研究。在模拟研究中,完整模型的表现优于所有辅助变量选择策略,LASSO 策略是整体表现最好的辅助变量选择策略。我们能够应用于案例研究的所有 MI 分析策略都导致了相似的估计值,尽管当使用变量选择时计算时间大大减少。这项研究进一步支持在可能的情况下采用包容性辅助变量策略。当完整模型失败或过于繁琐时,使用 LASSO 的辅助变量选择可能是一种有前途的替代方法。