Park Heewon, Niida Atsushi, Imoto Seiya, Miyano Satoru
1 Faculty of Global and Science Studies, Yamaguchi University , Yamaguchi Prefecture, Japan .
2 Health Intelligence Center, Institute of Medical Science, University of Tokyo , Tokyo, Japan .
J Comput Biol. 2017 Feb;24(2):138-152. doi: 10.1089/cmb.2016.0140. Epub 2016 Oct 19.
Driver gene selection is crucial to understand the heterogeneous system of cancer. To identity cancer driver genes, various statistical strategies have been proposed, especially the L-type regularization methods have drawn a large amount of attention. However, the statistical approaches have been developed purely from algorithmic and statistical point, and the existing studies have applied the statistical approaches to genomic data analysis without consideration of biological knowledge. We consider a statistical strategy incorporating biological knowledge to identify cancer driver gene. The alterations of copy number have been considered to driver cancer pathogenesis processes, and the region of strong interaction of copy number alterations and expression levels was known as a tumor-related symptom. We incorporate the influence of copy number alterations on expression levels to cancer driver gene-selection processes. To quantify the dependence of copy number alterations on expression levels, we consider [Formula: see text] and [Formula: see text] effects of copy number alterations on expression levels of genes, and incorporate the symptom of tumor pathogenesis to gene-selection procedures. We then proposed an interaction-based feature-selection strategy based on the adaptive L-type regularization and random lasso procedures. The proposed method imposes a large amount of penalty on genes corresponding to a low dependency of the two features, thus the coefficients of the genes are estimated to be small or exactly 0. It implies that the proposed method can provide biologically relevant results in cancer driver gene selection. Monte Carlo simulations and analysis of the Cancer Genome Atlas (TCGA) data show that the proposed strategy is effective for high-dimensional genomic data analysis. Furthermore, the proposed method provides reliable and biologically relevant results for cancer driver gene selection in TCGA data analysis.
驱动基因的选择对于理解癌症的异质性系统至关重要。为了识别癌症驱动基因,人们提出了各种统计策略,尤其是L型正则化方法受到了广泛关注。然而,这些统计方法纯粹是从算法和统计角度发展而来的,现有研究在将统计方法应用于基因组数据分析时并未考虑生物学知识。我们考虑一种结合生物学知识的统计策略来识别癌症驱动基因。拷贝数的改变被认为是驱动癌症发病过程的因素,而拷贝数改变与表达水平的强相互作用区域被称为肿瘤相关症状。我们将拷贝数改变对表达水平的影响纳入癌症驱动基因选择过程。为了量化拷贝数改变对表达水平的依赖性,我们考虑拷贝数改变对基因表达水平的[公式:见原文]和[公式:见原文]效应,并将肿瘤发病症状纳入基因选择程序。然后,我们基于自适应L型正则化和随机套索程序提出了一种基于相互作用的特征选择策略。该方法对两个特征依赖性低的基因施加大量惩罚,因此这些基因的系数估计值较小或恰好为0。这意味着该方法在癌症驱动基因选择中能够提供生物学上相关的结果。蒙特卡罗模拟和对癌症基因组图谱(TCGA)数据的分析表明,所提出的策略对于高维基因组数据分析是有效的。此外,在TCGA数据分析中,该方法为癌症驱动基因选择提供了可靠且生物学上相关的结果。
J Comput Biol. 2015-2
PLoS One. 2015-11-6
Breast Cancer Res Treat. 2015-7
IEEE/ACM Trans Comput Biol Bioinform. 2017
BMC Bioinformatics. 2020-3-11