Li Runze, Zhong Wei, Zhu Liping
The Pennsylvania State University, Xiamen University & Shanghai University of Finance and Economics.
J Am Stat Assoc. 2012 Jul 1;107(499):1129-1139. doi: 10.1080/01621459.2012.695654.
This paper is concerned with screening features in ultrahigh dimensional data analysis, which has become increasingly important in diverse scientific fields. We develop a sure independence screening procedure based on the distance correlation (DC-SIS, for short). The DC-SIS can be implemented as easily as the sure independence screening procedure based on the Pearson correlation (SIS, for short) proposed by Fan and Lv (2008). However, the DC-SIS can significantly improve the SIS. Fan and Lv (2008) established the sure screening property for the SIS based on linear models, but the sure screening property is valid for the DC-SIS under more general settings including linear models. Furthermore, the implementation of the DC-SIS does not require model specification (e.g., linear model or generalized linear model) for responses or predictors. This is a very appealing property in ultrahigh dimensional data analysis. Moreover, the DC-SIS can be used directly to screen grouped predictor variables and for multivariate response variables. We establish the sure screening property for the DC-SIS, and conduct simulations to examine its finite sample performance. Numerical comparison indicates that the DC-SIS performs much better than the SIS in various models. We also illustrate the DC-SIS through a real data example.
本文关注超高维数据分析中的筛选特征,这在多个科学领域中变得越来越重要。我们基于距离相关系数开发了一种确定独立性筛选程序(简称为DC-SIS)。DC-SIS的实施与Fan和Lv(2008)提出的基于Pearson相关系数的确定独立性筛选程序(简称为SIS)一样容易。然而,DC-SIS能显著改进SIS。Fan和Lv(2008)基于线性模型建立了SIS的确定筛选性质,但在包括线性模型在内的更一般设定下,确定筛选性质对DC-SIS也成立。此外,DC-SIS的实施不需要对响应变量或预测变量进行模型设定(例如线性模型或广义线性模型)。这在超高维数据分析中是一个非常吸引人的性质。而且,DC-SIS可直接用于筛选分组预测变量以及处理多变量响应变量。我们建立了DC-SIS的确定筛选性质,并进行模拟以检验其有限样本性能。数值比较表明,在各种模型中DC-SIS的表现都比SIS好得多。我们还通过一个实际数据例子来说明DC-SIS。