Xiong Wei, Chen Yaxian, Ma Shuangge
School of Statistics, University of International Business and Economics, Beijing 100872, PR China.
Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong.
Comput Stat Data Anal. 2023 Apr;180. doi: 10.1016/j.csda.2022.107684. Epub 2022 Dec 28.
For many practical high-dimensional problems, interactions have been increasingly found to play important roles beyond main effects. A representative example is gene-gene interaction. Joint analysis, which analyzes all interactions and main effects in a single model, can be seriously challenged by high dimensionality. For high-dimensional data analysis in general, marginal screening has been established as effective for reducing computational cost, increasing stability, and improving estimation/selection performance. Most of the existing marginal screening methods are designed for the analysis of main effects only. The existing screening methods for interaction analysis are often limited by making stringent model assumptions, lacking robustness, and/or requiring predictors to be continuous (and hence lacking flexibility). A unified marginal screening approach tailored to interaction analysis is developed, which can be applied to regression, classification, and survival analysis. Predictors are allowed to be continuous and discrete. The proposed approach is built on Coefficient of Variation (CV) filters based on information entropy. Statistical properties are rigorously established. It is shown that the CV filters are almost insensitive to the distribution tails of predictors, correlation structure among predictors, and sparsity level of signals. An efficient two-stage algorithm is developed to make the proposed approach scalable to ultrahigh-dimensional data. Simulations and the analysis of TCGA LUAD data further establish the practical superiority of the proposed approach.
对于许多实际的高维问题,人们越来越发现交互作用在主效应之外起着重要作用。一个典型的例子是基因-基因相互作用。联合分析在单个模型中分析所有交互作用和主效应,可能会受到高维性的严重挑战。一般来说,对于高维数据分析,边际筛选已被证明是有效的,它可以降低计算成本、提高稳定性并改善估计/选择性能。现有的大多数边际筛选方法仅设计用于主效应分析。现有的交互作用分析筛选方法通常受到严格模型假设的限制,缺乏稳健性,和/或要求预测变量是连续的(因此缺乏灵活性)。本文开发了一种专门针对交互作用分析的统一边际筛选方法,该方法可应用于回归、分类和生存分析。预测变量可以是连续的和离散的。所提出的方法基于基于信息熵的变异系数(CV)滤波器构建。严格建立了统计性质。结果表明,CV滤波器对预测变量的分布尾部、预测变量之间的相关结构和信号的稀疏水平几乎不敏感。开发了一种高效的两阶段算法,使所提出的方法能够扩展到超高维数据。模拟和对TCGA LUAD数据的分析进一步确立了所提出方法的实际优势。