Colaprico Antonio, Silva Tiago C, Olsen Catharina, Garofano Luciano, Cava Claudia, Garolini Davide, Sabedot Thais S, Malta Tathiane M, Pagnotta Stefano M, Castiglioni Isabella, Ceccarelli Michele, Bontempi Gianluca, Noushmehr Houtan
Interuniversity Institute of Bioinformatics in Brussels (IB), Brussels, Belgium Machine Learning Group (MLG), Department d'Informatique, Université libre de Bruxelles (ULB), Brussels, Belgium.
Department of Genetics Ribeirão Preto Medical School, University of São Paulo, Ribeirão Preto, São Paulo, Brazil Center for Integrative Systems Biology - CISBi, NAP/USP, Ribeirão Preto, São Paulo, Brazil.
Nucleic Acids Res. 2016 May 5;44(8):e71. doi: 10.1093/nar/gkv1507. Epub 2015 Dec 23.
The Cancer Genome Atlas (TCGA) research network has made public a large collection of clinical and molecular phenotypes of more than 10 000 tumor patients across 33 different tumor types. Using this cohort, TCGA has published over 20 marker papers detailing the genomic and epigenomic alterations associated with these tumor types. Although many important discoveries have been made by TCGA's research network, opportunities still exist to implement novel methods, thereby elucidating new biological pathways and diagnostic markers. However, mining the TCGA data presents several bioinformatics challenges, such as data retrieval and integration with clinical data and other molecular data types (e.g. RNA and DNA methylation). We developed an R/Bioconductor package called TCGAbiolinks to address these challenges and offer bioinformatics solutions by using a guided workflow to allow users to query, download and perform integrative analyses of TCGA data. We combined methods from computer science and statistics into the pipeline and incorporated methodologies developed in previous TCGA marker studies and in our own group. Using four different TCGA tumor types (Kidney, Brain, Breast and Colon) as examples, we provide case studies to illustrate examples of reproducibility, integrative analysis and utilization of different Bioconductor packages to advance and accelerate novel discoveries.
癌症基因组图谱(TCGA)研究网络公开了大量来自33种不同肿瘤类型的10000多名肿瘤患者的临床和分子表型数据。利用这一队列,TCGA发表了20多篇标志性论文,详细阐述了与这些肿瘤类型相关的基因组和表观基因组改变。尽管TCGA研究网络已经取得了许多重要发现,但仍有机会采用新方法,从而阐明新的生物学途径和诊断标志物。然而,挖掘TCGA数据面临着一些生物信息学挑战,如数据检索以及与临床数据和其他分子数据类型(如RNA和DNA甲基化)的整合。我们开发了一个名为TCGAbiolinks的R/Bioconductor软件包来应对这些挑战,并通过使用一个有指导的工作流程提供生物信息学解决方案,以允许用户查询、下载和对TCGA数据进行综合分析。我们将计算机科学和统计学方法整合到流程中,并纳入了先前TCGA标志物研究以及我们自己团队所开发的方法。以四种不同的TCGA肿瘤类型(肾、脑、乳腺和结肠)为例,我们提供案例研究,以说明可重复性、综合分析以及利用不同的Bioconductor软件包推进和加速新发现的实例。