Fan Jianqing, Samworth Richard, Wu Yichao
Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08540 USA.
J Mach Learn Res. 2009;10:2013-2038.
Variable selection in high-dimensional space characterizes many contemporary problems in scientific discovery and decision making. Many frequently-used techniques are based on independence screening; examples include correlation ranking (Fan and Lv, 2008) or feature selection using a two-sample t-test in high-dimensional classification (Tibshirani et al., 2003). Within the context of the linear model, Fan and Lv (2008) showed that this simple correlation ranking possesses a sure independence screening property under certain conditions and that its revision, called iteratively sure independent screening (ISIS), is needed when the features are marginally unrelated but jointly related to the response variable. In this paper, we extend ISIS, without explicit definition of residuals, to a general pseudo-likelihood framework, which includes generalized linear models as a special case. Even in the least-squares setting, the new method improves ISIS by allowing feature deletion in the iterative process. Our technique allows us to select important features in high-dimensional classification where the popularly used two-sample t-method fails. A new technique is introduced to reduce the false selection rate in the feature screening stage. Several simulated and two real data examples are presented to illustrate the methodology.
高维空间中的变量选择是科学发现和决策中许多当代问题的特征。许多常用技术都基于独立性筛选;例如相关排序(范剑青和吕毅,2008年)或在高维分类中使用两样本t检验进行特征选择(蒂布希拉尼等人,2003年)。在线性模型的背景下,范剑青和吕毅(2008年)表明,这种简单的相关排序在某些条件下具有确定的独立性筛选属性,并且当特征与响应变量边际无关但联合相关时,需要对其进行修正,即所谓的迭代确定独立筛选(ISIS)。在本文中,我们将ISIS扩展到一个一般的伪似然框架,该框架以广义线性模型为特例,且无需明确定义残差。即使在最小二乘设置中,新方法也通过允许在迭代过程中删除特征来改进ISIS。我们的技术使我们能够在高维分类中选择重要特征,而常用的两样本t方法在这种情况下会失效。本文引入了一种新技术来降低特征筛选阶段的错误选择率。给出了几个模拟示例和两个实际数据示例来说明该方法。