Nandy Debmalya, Chiaromonte Francesca, Li Runze
Department of Biostatistics & Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA.
Department of Statistics, Penn State University, University Park, PA 16802, USA.
J Am Stat Assoc. 2022;117(539):1516-1529. doi: 10.1080/01621459.2020.1864380. Epub 2021 Feb 10.
Contemporary high-throughput experimental and surveying techniques give rise to ultrahigh-dimensional supervised problems with sparse signals; that is, a limited number of observations (), each with a very large number of covariates ( >> ), only a small share of which is truly associated with the response. In these settings, major concerns on computational burden, algorithmic stability, and statistical accuracy call for substantially reducing the feature space by eliminating redundant covariates before the use of any sophisticated statistical analysis. Along the lines of (Fan and Lv, 2008) and other model- and correlation-based feature screening methods, we propose a model-free procedure called (CIS). CIS uses a marginal utility connected to the notion of the traditional Fisher Information, possesses the sure screening property, and is applicable to any type of response (features) with continuous features (response). Simulations and an application to transcriptomic data on rats reveal the comparative strengths of CIS over some popular feature screening methods.
当代高通量实验和测量技术引发了具有稀疏信号的超高维监督问题;也就是说,观测值数量有限(),每个观测值都有大量协变量(>>),其中只有一小部分与响应真正相关。在这些情况下,由于对计算负担、算法稳定性和统计准确性的主要担忧,需要在使用任何复杂的统计分析之前,通过消除冗余协变量来大幅减少特征空间。沿着(Fan和Lv,2008)以及其他基于模型和相关性的特征筛选方法的思路,我们提出了一种名为(CIS)的无模型程序。CIS使用与传统Fisher信息概念相关的边际效用,具有确定筛选属性,适用于具有连续特征(响应)的任何类型的响应(特征)。对大鼠转录组数据的模拟和应用揭示了CIS相对于一些流行特征筛选方法的比较优势。