Department of Biostatistics, School of Public Health, 33133Peking University Health Science Center, Beijing, China.
Beijing International Center for Mathematical Research, 12465Peking University, Beijing, China.
Stat Methods Med Res. 2023 Jan;32(1):22-40. doi: 10.1177/09622802221129043. Epub 2022 Sep 29.
Ultra-high dimensional data, such as gene and neuroimaging data, are becoming increasingly important in biomedical science. Identifying important biomarkers from the huge number of features can help us gain better insights into further researches. Variable screening is an efficient tool to achieve this goal under the large scale cases, which reduces the dimension of features into a moderate size by removing the major part of inactive ones. Developing novel variable screening methods for high-dimensional features with group structures is challenging, especially under the overlapped cases. For example, the huge-scaled genes usually can be partitioned into hundreds of pathways according to background knowledge. One primary characteristic for this type of data is that many genes may appear across more than one pathway, which means that different pathways are overlapped. However, existing variable screening methods only could deal with disjoint group structure cases. To fill this gap, we propose a novel variable screening method for the generalized linear model by incorporating overlapped partition structures with theoretical guarantee. Besides the sure screening property, we also test the performance of the proposed method through a series of numerical studies and apply it to statistical analysis of a breast cancer data.
超高维数据,如基因和神经影像学数据,在生物医学科学中变得越来越重要。从大量特征中识别重要的生物标志物可以帮助我们更好地深入研究。变量筛选是在大规模情况下实现这一目标的有效工具,它通过去除主要的非活性部分将特征的维度降低到适中的大小。开发具有组结构的高维特征的新型变量筛选方法具有挑战性,特别是在重叠情况下。例如,根据背景知识,庞大的基因通常可以分为数百个途径。这类数据的一个主要特征是,许多基因可能出现在不止一个途径中,这意味着不同的途径是重叠的。然而,现有的变量筛选方法只能处理不相交的分组结构情况。为了填补这一空白,我们提出了一种新的广义线性模型变量筛选方法,通过理论保证将重叠分区结构纳入其中。除了可靠的筛选特性外,我们还通过一系列数值研究来测试所提出方法的性能,并将其应用于乳腺癌数据的统计分析。