Florida State University, Department of Statistics, FL, USA.
Neuroimage. 2011 Apr 15;55(4):1519-27. doi: 10.1016/j.neuroimage.2010.12.028. Epub 2010 Dec 15.
The goals of this paper are to review the most popular methods of predictor selection in regression models, to explain why some fail when the number P of explanatory variables exceeds the number N of participants, and to discuss alternative statistical methods that can be employed in this case. We focus on penalized least squares methods in regression models, and discuss in detail two such methods that are well established in the statistical literature, the LASSO and Elastic Net. We introduce bootstrap enhancements of these methods, the BE-LASSO and BE-Enet, that allow the user to attach a measure of uncertainty to each variable selected. Our work is motivated by a multimodal neuroimaging dataset that consists of morphometric measures (volumes at several anatomical regions of interest), white matter integrity measures from diffusion weighted data (fractional anisotropy, mean diffusivity, axial diffusivity and radial diffusivity) and clinical and demographic variables (age, education, alcohol and drug history). In this dataset, the number P of explanatory variables exceeds the number N of participants. We use the BE-LASSO and BE-Enet to provide the first statistical analysis that allows the assessment of neurocognitive performance from high dimensional neuroimaging and clinical predictors, including their interactions. The major novelty of this analysis is that biomarker selection and dimension reduction are accomplished with a view towards obtaining good predictions for the outcome of interest (i.e., the neurocognitive indices), unlike principal component analysis that are performed only on the predictors' space independently of the outcome of interest.
本文的目的是回顾回归模型中最流行的预测因子选择方法,解释为什么当解释变量的数量 P 超过参与者的数量 N 时,有些方法会失败,并讨论在这种情况下可以采用的替代统计方法。我们专注于回归模型中的惩罚最小二乘法方法,并详细讨论了两种在统计文献中得到很好确立的方法,即 LASSO 和弹性网络。我们引入了这些方法的引导增强版,即 BE-LASSO 和 BE-Enet,它们允许用户为每个选中的变量附加不确定性度量。我们的工作是由一个多模态神经影像学数据集驱动的,该数据集由形态计量学指标(几个感兴趣的解剖区域的体积)、来自扩散加权数据的白质完整性指标(各向异性分数、平均扩散率、轴向扩散率和径向扩散率)以及临床和人口统计学变量(年龄、教育、酒精和药物史)组成。在这个数据集中,解释变量的数量 P 超过了参与者的数量 N。我们使用 BE-LASSO 和 BE-Enet 进行了首次统计分析,该分析允许从高维神经影像学和临床预测因子中评估神经认知性能,包括它们的相互作用。该分析的主要新颖之处在于,生物标志物选择和降维是为了获得对感兴趣的结果(即神经认知指数)的良好预测,而不像主成分分析那样仅在不考虑感兴趣的结果的情况下在预测因子空间上进行。