Yang Hai, Wei Qiang, Zhong Xue, Yang Hushan, Li Bingshan
Department of Molecular Physiology & Biophysics, Vanderbilt University, Nashville, TN, USA.
Vanderbilt Genetics Institute, Nashville, TN, USA.
Bioinformatics. 2017 Feb 15;33(4):483-490. doi: 10.1093/bioinformatics/btw662.
Comprehensive catalogue of genes that drive tumor initiation and progression in cancer is key to advancing diagnostics, therapeutics and treatment. Given the complexity of cancer, the catalogue is far from complete yet. Increasing evidence shows that driver genes exhibit consistent aberration patterns across multiple-omics in tumors. In this study, we aim to leverage complementary information encoded in each of the omics data to identify novel driver genes through an integrative framework. Specifically, we integrated mutations, gene expression, DNA copy numbers, DNA methylation and protein abundance, all available in The Cancer Genome Atlas (TCGA) and developed iDriver, a non-parametric Bayesian framework based on multivariate statistical modeling to identify driver genes in an unsupervised fashion. iDriver captures the inherent clusters of gene aberrations and constructs the background distribution that is used to assess and calibrate the confidence of driver genes identified through multi-dimensional genomic data.
We applied the method to 4 cancer types in TCGA and identified candidate driver genes that are highly enriched with known drivers. (e.g.: P < 3.40 × 10 -36 for breast cancer). We are particularly interested in novel genes and observed multiple lines of supporting evidence. Using systematic evaluation from multiple independent aspects, we identified 45 candidate driver genes that were not previously known across these 4 cancer types. The finding has important implications that integrating additional genomic data with multivariate statistics can help identify cancer drivers and guide the next stage of cancer genomics research.
The C ++ source code is freely available at https://medschool.vanderbilt.edu/cgg/ .
hai.yang@vanderbilt.edu or bingshan.li@Vanderbilt.Edu.
Supplementary data are available at Bioinformatics online.
全面列出驱动癌症发生和发展的基因目录是推进癌症诊断、治疗和疗法的关键。鉴于癌症的复杂性,该目录目前还远未完整。越来越多的证据表明,驱动基因在肿瘤的多种组学中呈现出一致的畸变模式。在本研究中,我们旨在利用每个组学数据中编码的互补信息,通过一个整合框架来识别新的驱动基因。具体而言,我们整合了《癌症基因组图谱》(TCGA)中所有可用的突变、基因表达、DNA拷贝数、DNA甲基化和蛋白质丰度数据,并开发了iDriver,这是一个基于多变量统计建模的非参数贝叶斯框架,用于以无监督方式识别驱动基因。iDriver捕捉基因畸变的固有聚类,并构建背景分布,用于评估和校准通过多维基因组数据识别出的驱动基因的可信度。
我们将该方法应用于TCGA中的4种癌症类型,识别出了高度富集已知驱动基因的候选驱动基因(例如:乳腺癌的P < 3.40×10-36)。我们对新基因特别感兴趣,并观察到了多条支持证据。通过从多个独立方面进行系统评估,我们在这4种癌症类型中识别出了45个以前未知的候选驱动基因。这一发现具有重要意义,即整合额外的基因组数据和多变量统计可以帮助识别癌症驱动基因,并指导癌症基因组学研究的下一阶段。
C++ 源代码可在https://medschool.vanderbilt.edu/cgg/ 免费获取。
hai.yang@vanderbilt.edu 或 bingshan.li@Vanderbilt.Edu。
补充数据可在《生物信息学》在线获取。