Wang Luheng, Liu Jingyuan, Li Yong, Li Runze
School of Mathematics, Beijing Normal University, Beijing 100875, P.R. China.
Department of Statistics, School of Economics, Wang Yanan Institute for Studies in Economics and Fujian Key Laboratory of Statistical Science, Xiamen University, Xiamen 361005, China.
Sci China Math. 2017 Mar;60(3):551-568. doi: 10.1007/s11425-016-0186-8. Epub 2016 Dec 29.
Feature screening plays an important role in ultrahigh dimensional data analysis. This paper is concerned with conditional feature screening when one is interested in detecting the association between the response and ultrahigh dimensional predictors (e.g., genetic makers) given a low-dimensional exposure variable (such as clinical variables or environmental variables). To this end, we first propose a new index to measure conditional independence, and further develop a conditional screening procedure based on the newly proposed index. We systematically study the theoretical property of the proposed procedure and establish the sure screening and ranking consistency properties under some very mild conditions. The newly proposed screening procedure enjoys some appealing properties. (a) It is model-free in that its implementation does not require a specification on the model structure; (b) it is robust to heavy-tailed distributions or outliers in both directions of response and predictors; and (c) it can deal with both feature screening and the conditional screening in a unified way. We study the finite sample performance of the proposed procedure by Monte Carlo simulations and further illustrate the proposed method through two real data examples.
特征筛选在超高维数据分析中起着重要作用。本文关注的是当人们想要检测响应变量与超高维预测变量(如基因标记)之间在给定低维暴露变量(如临床变量或环境变量)情况下的关联时的条件特征筛选。为此,我们首先提出一种新的指标来衡量条件独立性,并基于新提出的指标进一步开发一种条件筛选程序。我们系统地研究了所提出程序的理论性质,并在一些非常温和的条件下建立了确定筛选和排序一致性性质。新提出的筛选程序具有一些吸引人的性质。(a)它是无模型的,因为其实现不需要指定模型结构;(b)它对响应变量和预测变量两个方向上的重尾分布或异常值具有鲁棒性;(c)它可以以统一的方式处理特征筛选和条件筛选。我们通过蒙特卡罗模拟研究了所提出程序的有限样本性能,并通过两个实际数据例子进一步说明了所提出的方法。