Fanfani Viola, Shutta Katherine H, Mandros Panagiotis, Fischer Jonas, Saha Enakshi, Micheletti Soel, Chen Chen, Guebila Marouen Ben, Lopes-Ramos Camila M, Quackenbush John
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, MA, USA.
bioRxiv. 2024 Nov 7:2024.11.05.622163. doi: 10.1101/2024.11.05.622163.
Technological advances in sequencing and computation have allowed deep exploration of the molecular basis of diseases. Biological networks have proven to be a useful framework for interrogating omics data and modeling regulatory gene and protein interactions. Large collaborative projects, such as The Cancer Genome Atlas (TCGA), have provided a rich resource for building and validating new computational methods resulting in a plethora of open-source software for downloading, pre-processing, and analyzing those data. However, for an end-to-end analysis of regulatory networks a coherent and reusable workflow is essential to integrate all relevant packages into a robust pipeline.
We developed tcga-data-nf, a Nextflow workflow that allows users to reproducibly infer regulatory networks from the thousands of samples in TCGA using a single command. The workflow can be divided into three main steps: multi-omics data, such as RNA-seq and methylation, are downloaded, preprocessed, and lastly used to infer regulatory network models with the netZoo software tools. The workflow is powered by the NetworkDataCompanion R package, a standalone collection of functions for managing, mapping, and filtering TCGA data. Here we show how the pipeline can be used to study the differences between colon cancer subtypes that could be explained by epigenetic mechanisms. Lastly, we provide pre-generated networks for the 10 most common cancer types that can be readily accessed.
tcga-data-nf is a complete yet flexible and extensible framework that enables the reproducible inference and analysis of cancer regulatory networks, bridging a gap in the current universe of software tools.
测序和计算技术的进步使得对疾病分子基础的深入探索成为可能。生物网络已被证明是用于询问组学数据以及对调控基因和蛋白质相互作用进行建模的有用框架。诸如癌症基因组图谱(TCGA)之类的大型合作项目为构建和验证新的计算方法提供了丰富资源,从而产生了大量用于下载、预处理和分析这些数据的开源软件。然而,对于调控网络的端到端分析,一个连贯且可重复使用的工作流程对于将所有相关软件包集成到一个强大的管道中至关重要。
我们开发了tcga-data-nf,这是一个Nextflow工作流程,它允许用户使用单个命令从TCGA的数千个样本中可重复地推断调控网络。该工作流程可分为三个主要步骤:下载、预处理多组学数据,如RNA测序和甲基化数据,最后使用netZoo软件工具推断调控网络模型。该工作流程由NetworkDataCompanion R软件包驱动,这是一个用于管理、映射和过滤TCGA数据的独立函数集合。在这里,我们展示了该管道如何用于研究可由表观遗传机制解释的结肠癌亚型之间的差异。最后,我们提供了10种最常见癌症类型的预生成网络,可随时访问。
tcga-data-nf是一个完整但灵活且可扩展的框架,能够对癌症调控网络进行可重复的推断和分析,弥补了当前软件工具领域的一个空白。