Center for Translational Data Science, University of Chicago, Chicago, IL, USA.
AbbVie Inc., Redwood City, CA, USA.
Nat Commun. 2021 Feb 22;12(1):1226. doi: 10.1038/s41467-021-21254-9.
The goal of the National Cancer Institute's (NCI's) Genomic Data Commons (GDC) is to provide the cancer research community with a data repository of uniformly processed genomic and associated clinical data that enables data sharing and collaborative analysis in the support of precision medicine. The initial GDC dataset include genomic, epigenomic, proteomic, clinical and other data from the NCI TCGA and TARGET programs. Data production for the GDC started in June, 2015 using an OpenStack-based private cloud. By June of 2016, the GDC had analyzed more than 50,000 raw sequencing data inputs, as well as multiple other data types. Using the latest human genome reference build GRCh38, the GDC generated a variety of data types from aligned reads to somatic mutations, gene expression, miRNA expression, DNA methylation status, and copy number variation. In this paper, we describe the pipelines and workflows used to process and harmonize the data in the GDC. The generated data, as well as the original input files from TCGA and TARGET, are available for download and exploratory analysis at the GDC Data Portal and Legacy Archive ( https://gdc.cancer.gov/ ).
美国国家癌症研究所(NCI)的基因组数据共享中心(GDC)的目标是为癌症研究界提供一个基因组和相关临床数据的统一处理数据库,支持数据共享和协作分析,以支持精准医学。最初的 GDC 数据集包括来自 NCI TCGA 和 TARGET 计划的基因组学、表观基因组学、蛋白质组学、临床和其他数据。GDC 的数据生产于 2015 年 6 月开始,使用基于 OpenStack 的私有云。到 2016 年 6 月,GDC 已经分析了超过 50000 个原始测序数据输入,以及多种其他数据类型。利用最新的人类基因组参考构建体 GRCh38,GDC 从对齐的读取到体细胞突变、基因表达、miRNA 表达、DNA 甲基化状态和拷贝数变异等各种数据类型。在本文中,我们描述了用于处理和协调 GDC 中数据的管道和工作流程。生成的数据以及 TCGA 和 TARGET 的原始输入文件可在 GDC 数据门户和传统档案(https://gdc.cancer.gov/)上下载和进行探索性分析。