Graduate School of Public Health and Health Policy, City University of New York, New York, NY.
Institute for Implementation Science and Population Health, City University of New York, New York, NY.
JCO Clin Cancer Inform. 2020 Oct;4:958-971. doi: 10.1200/CCI.19.00119.
Investigations of the molecular basis for the development, progression, and treatment of cancer increasingly use complementary genomic assays to gather multiomic data, but management and analysis of such data remain complex. The cBioPortal for cancer genomics currently provides multiomic data from > 260 public studies, including The Cancer Genome Atlas (TCGA) data sets, but integration of different data types remains challenging and error prone for computational methods and tools using these resources. Recent advances in data infrastructure within the Bioconductor project enable a novel and powerful approach to creating fully integrated representations of these multiomic, pan-cancer databases.
We provide a set of R/Bioconductor packages for working with TCGA legacy data and cBioPortal data, with special considerations for loading time; efficient representations in and out of memory; analysis platform; and an integrative framework, such as MultiAssayExperiment. Large methylation data sets are provided through out-of-memory data representation to provide responsive loading times and analysis capabilities on machines with limited memory.
We developed the curatedTCGAData and cBioPortalData R/Bioconductor packages to provide integrated multiomic data sets from the TCGA legacy database and the cBioPortal web application programming interface using the MultiAssayExperiment data structure. This suite of tools provides coordination of diverse experimental assays with clinicopathological data with minimal data management burden, as demonstrated through several greatly simplified multiomic and pan-cancer analyses.
These integrated representations enable analysts and tool developers to apply general statistical and plotting methods to extensive multiomic data through user-friendly commands and documented examples.
为了研究癌症的发展、进展和治疗的分子基础,研究人员越来越多地使用互补的基因组检测来收集多组学数据,但此类数据的管理和分析仍然很复杂。癌症基因组学的 cBioPortal 目前提供了来自>260 个公共研究的多组学数据,包括癌症基因组图谱 (TCGA) 数据集,但对于使用这些资源的计算方法和工具来说,不同数据类型的整合仍然具有挑战性且容易出错。Bioconductor 项目中的数据基础设施的最新进展使创建这些多组学、泛癌症数据库的完全集成表示成为一种新颖而强大的方法。
我们提供了一套用于处理 TCGA 遗产数据和 cBioPortal 数据的 R/Bioconductor 包,特别考虑了加载时间;在内存中和内存外的高效表示;分析平台;以及一个整合框架,如 MultiAssayExperiment。通过内存外数据表示提供大型甲基化数据集,以在内存有限的机器上提供响应式加载时间和分析功能。
我们开发了 curatedTCGAData 和 cBioPortalData R/Bioconductor 包,以使用 MultiAssayExperiment 数据结构从 TCGA 遗产数据库和 cBioPortal Web 应用程序编程接口提供集成的多组学数据集。该工具套件通过最小的数据管理负担,提供了与临床病理数据协调的各种实验检测,通过几个大大简化的多组学和泛癌症分析证明了这一点。
这些集成表示使分析人员和工具开发人员能够通过用户友好的命令和记录的示例将通用统计和绘图方法应用于广泛的多组学数据。