Zhou Tingyou, Zhu Liping, Xu Chen, Li Runze
School of Data Sciences, Zhejiang University of Finance and Economics, Hangzhou, P. R. China.
Institute of Statistics and Big Data and Center for Applied Statistics, Renmin University of China, Beijing, P. R. China.
J Am Stat Assoc. 2020;115(531):1393-1405. doi: 10.1080/01621459.2019.1632078. Epub 2019 Jul 22.
Feature screening plays an important role in the analysis of ultrahigh dimensional data. Due to complicated model structure and high noise level, existing screening methods often suffer from model misspecification and the presence of outliers. To address these issues, we introduce a new metric named cumulative divergence (CD), and develop a CD-based forward screening procedure. This forward screening method is model-free and resistant to the presence of outliers in the response. It also incorporates the joint effects among covariates into the screening process. With a data-driven threshold, the new method can automatically determine the number of features that should be retained after screening. These merits make the CD-based screening very appealing in practice. Under certain regularity conditions, we show that the proposed method possesses sure screening property. The performance of our proposal is illustrated through simulations and a real data example.
特征筛选在超高维数据的分析中起着重要作用。由于模型结构复杂且噪声水平高,现有的筛选方法常常受到模型误设和异常值存在的困扰。为了解决这些问题,我们引入了一种名为累积散度(CD)的新度量,并开发了一种基于CD的前向筛选程序。这种前向筛选方法是无模型的,并且对响应中的异常值具有抗性。它还将协变量之间的联合效应纳入筛选过程。通过一个数据驱动的阈值,新方法可以自动确定筛选后应保留的特征数量。这些优点使得基于CD的筛选在实际应用中非常有吸引力。在一定的正则性条件下,我们证明了所提出的方法具有确定筛选性质。通过模拟和一个实际数据例子说明了我们方法的性能。