Kemmo Tsafack Ulrich, Lin Chien-Wei, Ahn Kwang Woo
Division of Biostatistics, Medical College of Wisconsin (MCW), Milwaukee, WI 53226, USA.
Bioengineering (Basel). 2024 Nov 25;11(12):1193. doi: 10.3390/bioengineering11121193.
Investigators often face ultra-high dimensional multi-omics data, where identifying significant genes and omics within a gene is of interest. In such data, each gene forms a group consisting of its multiple omics. Moreover, some genes may also be highly correlated. This leads to a tri-level hierarchical structured data: the cluster level, which is the group of correlated genes, the subgroup level, which is the group of omics of the same gene, and the individual level, which consists of omics. Screening is widely used to remove unimportant variables so that the number of remaining variables becomes smaller than the sample size. Penalized regression with the remaining variables after performing screening is then used to identify important variables. To screen unimportant genes, we propose to cluster genes and conduct screening. We show that the proposed screening method possesses the sure screening property. Extensive simulations show that the proposed screening method outperforms competing methods. We apply the proposed variable selection method to the TCGA breast cancer dataset to identify genes and omics that are related to breast cancer.
研究人员经常面临超高维多组学数据,其中识别重要基因以及基因内的组学信息是研究的重点。在这类数据中,每个基因形成一个由其多个组学组成的组。此外,一些基因可能也高度相关。这导致了一种三级层次结构数据:聚类层次,即相关基因的组;子组层次,即同一基因的组学的组;个体层次,由组学组成。筛选被广泛用于去除不重要的变量,以使剩余变量的数量小于样本量。然后,对筛选后剩余的变量进行惩罚回归,以识别重要变量。为了筛选不重要的基因,我们建议对基因进行聚类并进行筛选。我们表明,所提出的筛选方法具有确定筛选性质。大量模拟表明,所提出的筛选方法优于其他竞争方法。我们将所提出的变量选择方法应用于TCGA乳腺癌数据集,以识别与乳腺癌相关的基因和组学信息。