Department of Biostatistics, Columbia University Mailman School of Public Health, New York, NY, USA.
Stat Med. 2013 Sep 20;32(21):3646-59. doi: 10.1002/sim.5783. Epub 2013 Mar 25.
Multiple imputation (MI) is a commonly used technique for handling missing data in large-scale medical and public health studies. However, variable selection on multiply-imputed data remains an important and longstanding statistical problem. If a variable selection method is applied to each imputed dataset separately, it may select different variables for different imputed datasets, which makes it difficult to interpret the final model or draw scientific conclusions. In this paper, we propose a novel multiple imputation-least absolute shrinkage and selection operator (MI-LASSO) variable selection method as an extension of the least absolute shrinkage and selection operator (LASSO) method to multiply-imputed data. The MI-LASSO method treats the estimated regression coefficients of the same variable across all imputed datasets as a group and applies the group LASSO penalty to yield a consistent variable selection across multiple-imputed datasets. We use a simulation study to demonstrate the advantage of the MI-LASSO method compared with the alternatives. We also apply the MI-LASSO method to the University of Michigan Dioxin Exposure Study to identify important circumstances and exposure factors that are associated with human serum dioxin concentration in Midland, Michigan.
多重插补(MI)是处理大规模医学和公共卫生研究中缺失数据的常用技术。然而,在多重插补数据上进行变量选择仍然是一个重要且长期存在的统计问题。如果将变量选择方法分别应用于每个插补数据集,那么对于不同的插补数据集,它可能会选择不同的变量,这使得最终模型难以解释或得出科学结论。在本文中,我们提出了一种新的多重插补-最小绝对收缩和选择算子(MI-LASSO)变量选择方法,作为最小绝对收缩和选择算子(LASSO)方法对多重插补数据的扩展。MI-LASSO 方法将同一变量在所有插补数据集中的估计回归系数视为一组,并应用组 LASSO 惩罚,以在多个插补数据集中得到一致的变量选择。我们使用模拟研究来证明 MI-LASSO 方法相对于其他方法的优势。我们还将 MI-LASSO 方法应用于密歇根大学二恶英暴露研究,以确定与密歇根州米德兰市人类血清中二恶英浓度相关的重要情况和暴露因素。