Park Heewon, Niida Atushi, Miyano Satoru, Imoto Seiya
Human Genome Center, the Institute of Medical Science, the University of Tokyo , Tokyo, Japan .
J Comput Biol. 2015 Feb;22(2):73-84. doi: 10.1089/cmb.2014.0197. Epub 2015 Jan 28.
Gene networks and graphs are crucial tools for understanding a heterogeneous system of cancer, since cancer is a disease that does not involve individual genes but combinations of genes associated with oncogenic process. A goal of genomic data analysis via gene networks is to identify both gene networks and individual genes within the selected networks. Existing methods, however, perform only network selection, and thus all genes in selected networks are included in models. This leads to overfitting when uncovering driver genes, and the results are not biologically interpretable. To accomplish both "groupwise sparsity" and "within group sparsity" for identifying driver genes based on biological knowledge (i.e., predefined overlapping groups of features), we propose a sparse overlapping group lasso via duplicated predictors in extended space. The proposed method effectively identifies driver genes and their interactions using known biological pathway information. Monte Carlo simulations and The Cancer Genome Atlas (TCGA) project data analysis indicate that the proposed method is effective for fitting a regression model (i.e., feature selection and prediction accuracy) constructed with duplicated predictors in overlapping groups. In the TCGA data analysis, we uncover potential cancer driver genes via expression modules and gene networks constructed by multi-omics data and identify that the uncovered genes have strong evidences as a cancer driver gene. The proposed method is a useful tool for identifying cancer driver genes and for integrative multi-omics analysis.
基因网络和图谱是理解癌症异质性系统的关键工具,因为癌症是一种并非涉及单个基因,而是与致癌过程相关的基因组合的疾病。通过基因网络进行基因组数据分析的一个目标是识别基因网络以及所选网络中的单个基因。然而,现有方法仅执行网络选择,因此所选网络中的所有基因都包含在模型中。这在揭示驱动基因时会导致过拟合,并且结果在生物学上无法解释。为了基于生物学知识(即预定义的重叠特征组)实现识别驱动基因的“组稀疏性”和“组内稀疏性”,我们通过扩展空间中的重复预测变量提出了一种稀疏重叠组套索。所提出的方法利用已知的生物途径信息有效地识别驱动基因及其相互作用。蒙特卡罗模拟和癌症基因组图谱(TCGA)项目数据分析表明,所提出的方法对于拟合由重叠组中的重复预测变量构建的回归模型(即特征选择和预测准确性)是有效的。在TCGA数据分析中,我们通过多组学数据构建的表达模块和基因网络发现了潜在的癌症驱动基因,并确定所发现的基因作为癌症驱动基因有强有力的证据。所提出的方法是识别癌症驱动基因和进行综合多组学分析的有用工具。