Zhao Yize, Long Qi
Department of Healthcare Policy and Research, Weill Cornell Medical College, Cornell University.
Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania.
Wiley Interdiscip Rev Comput Stat. 2017 Sep-Oct;9(5). doi: 10.1002/wics.1402. Epub 2017 May 24.
Variable selection plays an essential role in regression analysis as it identifies important variables that associated with outcomes and is known to improve predictive accuracy of resulting models. Variable selection methods have been widely investigated for fully observed data. However, in the presence of missing data, methods for variable selection need to be carefully designed to account for missing data mechanisms and statistical techniques used for handling missing data. Since imputation is arguably the most popular method for handling missing data due to its ease of use, statistical methods for variable selection that are combined with imputation are of particular interest. These methods, valid used under the assumptions of missing at random (MAR) and missing completely at random (MCAR), largely fall into three general strategies. The first strategy applies existing variable selection methods to each imputed dataset and then combine variable selection results across all imputed datasets. The second strategy applies existing variable selection methods to stacked imputed datasets. The third variable selection strategy combines resampling techniques such as bootstrap with imputation. Despite recent advances, this area remains under-developed and offers fertile ground for further research.
变量选择在回归分析中起着至关重要的作用,因为它能识别与结果相关的重要变量,并且已知可以提高所得模型的预测准确性。变量选择方法已针对完全观测数据进行了广泛研究。然而,在存在缺失数据的情况下,需要精心设计变量选择方法,以考虑缺失数据机制以及用于处理缺失数据的统计技术。由于插补因其易用性可说是处理缺失数据最常用的方法,因此与插补相结合的变量选择统计方法特别受关注。这些方法在随机缺失(MAR)和完全随机缺失(MCAR)的假设下有效使用,主要分为三种一般策略。第一种策略是将现有的变量选择方法应用于每个插补数据集,然后合并所有插补数据集的变量选择结果。第二种策略是将现有的变量选择方法应用于堆叠的插补数据集。第三种变量选择策略是将重采样技术(如自助法)与插补相结合。尽管最近取得了进展,但该领域仍未充分发展,为进一步研究提供了丰富的空间。