Carter Knute D, Cavanaugh Joseph E
Department of Biostatistics, University of Iowa, Iowa City, IA, USA.
J Appl Stat. 2019 Jul 25;47(13-15):2384-2420. doi: 10.1080/02664763.2019.1645097. eCollection 2020.
A common model selection approach is to select the best model, according to some criterion, from among the collection of models defined by all possible subsets of the explanatory variables. Identifying an optimal subset has proven to be a challenging problem, both statistically and computationally. Our model selection procedure allows the researcher to nominate, a priori, the probability at which models containing false or spurious variables will be selected from among all possible subsets. The procedure determines whether inclusion of each candidate variable results in a sufficiently improved fitting term - and is hence named the SIFT procedure. Two variants are proposed: a naive method based on a set of restrictive assumptions and an empirical permutation-based method. Properties of these methods are investigated within the standard linear modeling framework and performance is evaluated against other model selection techniques. The SIFT procedure behaves as designed - asymptotically selecting variables that characterize the underlying data generating mechanism, while limiting selection of spurious variables to the desired level. The SIFT methodology offers researchers a promising new approach to model selection, providing the ability to control the probability of selecting a model that includes spurious variables to a level based on the context of the application.
一种常见的模型选择方法是根据某种标准,从由解释变量的所有可能子集定义的模型集合中选择最佳模型。事实证明,识别最优子集在统计和计算方面都是一个具有挑战性的问题。我们的模型选择程序允许研究人员事先指定从所有可能子集中选择包含虚假或伪变量的模型的概率。该程序确定每个候选变量的纳入是否会导致拟合项有足够的改善,因此被称为筛选程序(SIFT程序)。提出了两种变体:一种基于一组限制性假设的朴素方法和一种基于经验排列的方法。在标准线性建模框架内研究了这些方法的性质,并与其他模型选择技术进行了性能评估。筛选程序按设计运行——渐近地选择表征潜在数据生成机制的变量,同时将伪变量的选择限制在期望的水平。筛选方法为研究人员提供了一种有前景的新的模型选择方法,能够将选择包含伪变量的模型的概率控制在基于应用背景的水平。