Barut Emre, Fan Jianqing, Verhasselt Anneleen
Department of Statistics, George Washington University, Washington, DC 20052, USA.
Department of Operations Research & Financial Engineering, Princeton University, Princeton, NJ 08544, USA and special-term professor, School of Big Data, Fudan University, Shanghai, China.
J Am Stat Assoc. 2016;111(515):1266-1277. doi: 10.1080/01621459.2015.1092974. Epub 2016 Oct 18.
Independence screening is powerful for variable selection when the number of variables is massive. Commonly used independence screening methods are based on marginal correlations or its variants. When some prior knowledge on a certain important set of variables is available, a natural assessment on the relative importance of the other predictors is their conditional contributions to the response given the known set of variables. This results in conditional sure independence screening (CSIS). CSIS produces a rich family of alternative screening methods by different choices of the conditioning set and can help reduce the number of false positive and false negative selections when covariates are highly correlated. This paper proposes and studies CSIS in generalized linear models. We give conditions under which sure screening is possible and derive an upper bound on the number of selected variables. We also spell out the situation under which CSIS yields model selection consistency and the properties of CSIS when a data-driven conditioning set is used. Moreover, we provide two data-driven methods to select the thresholding parameter of conditional screening. The utility of the procedure is illustrated by simulation studies and analysis of two real datasets.
当变量数量众多时,独立性筛选在变量选择方面具有强大作用。常用的独立性筛选方法基于边际相关性或其变体。当关于某一重要变量集的一些先验知识可用时,对其他预测变量相对重要性的一种自然评估是它们在给定已知变量集的情况下对响应的条件贡献。这就产生了条件确定独立性筛选(CSIS)。通过对条件集的不同选择,CSIS产生了一系列丰富的替代筛选方法,并且当协变量高度相关时,有助于减少误选和漏选的数量。本文在广义线性模型中提出并研究了CSIS。我们给出了能够进行确定筛选的条件,并推导了所选变量数量的上限。我们还详细说明了CSIS产生模型选择一致性的情况以及使用数据驱动的条件集时CSIS的性质。此外,我们提供了两种数据驱动的方法来选择条件筛选的阈值参数。通过模拟研究和对两个真实数据集的分析说明了该方法的实用性。