Guzman Carlos, D'Orso Iván
Department of Microbiology, The University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
Present address: Bioinformatics and Systems Biology Graduate Program, University of California, La Jolla, San Diego, CA, 92093, USA.
BMC Bioinformatics. 2017 Aug 8;18(1):363. doi: 10.1186/s12859-017-1770-1.
Next-generation sequencing (NGS) approaches are commonly used to identify key regulatory networks that drive transcriptional programs. Although these technologies are frequently used in biological studies, NGS data analysis remains a challenging, time-consuming, and often irreproducible process. Therefore, there is a need for a comprehensive and flexible workflow platform that can accelerate data processing and analysis so more time can be spent on functional studies.
We have developed an integrative, stand-alone workflow platform, named CIPHER, for the systematic analysis of several commonly used NGS datasets including ChIP-seq, RNA-seq, MNase-seq, DNase-seq, GRO-seq, and ATAC-seq data. CIPHER implements various open source software packages, in-house scripts, and Docker containers to analyze and process single-ended and pair-ended datasets. CIPHER's pipelines conduct extensive quality and contamination control checks, as well as comprehensive downstream analysis. A typical CIPHER workflow includes: (1) raw sequence evaluation, (2) read trimming and adapter removal, (3) read mapping and quality filtering, (4) visualization track generation, and (5) extensive quality control assessment. Furthermore, CIPHER conducts downstream analysis such as: narrow and broad peak calling, peak annotation, and motif identification for ChIP-seq, differential gene expression analysis for RNA-seq, nucleosome positioning for MNase-seq, DNase hypersensitive site mapping, site annotation and motif identification for DNase-seq, analysis of nascent transcription from Global-Run On (GRO-seq) data, and characterization of chromatin accessibility from ATAC-seq datasets. In addition, CIPHER contains an "analysis" mode that completes complex bioinformatics tasks such as enhancer discovery and provides functions to integrate various datasets together.
Using public and simulated data, we demonstrate that CIPHER is an efficient and comprehensive workflow platform that can analyze several NGS datasets commonly used in genome biology studies. Additionally, CIPHER's integrative "analysis" mode allows researchers to elicit important biological information from the combined dataset analysis.
下一代测序(NGS)方法常用于识别驱动转录程序的关键调控网络。尽管这些技术在生物学研究中经常使用,但NGS数据分析仍然是一个具有挑战性、耗时且往往不可重复的过程。因此,需要一个全面且灵活的工作流程平台,以加速数据处理和分析,从而有更多时间用于功能研究。
我们开发了一个名为CIPHER的集成式独立工作流程平台,用于系统分析包括ChIP-seq、RNA-seq、MNase-seq、DNase-seq、GRO-seq和ATAC-seq数据在内的几种常用NGS数据集。CIPHER实现了各种开源软件包、内部脚本和Docker容器,以分析和处理单端和双端数据集。CIPHER的管道进行广泛的质量和污染控制检查以及全面的下游分析。一个典型的CIPHER工作流程包括:(1)原始序列评估,(2)读段修剪和接头去除,(3)读段比对和质量过滤,(4)可视化轨迹生成,以及(5)广泛的质量控制评估。此外,CIPHER进行下游分析,如:ChIP-seq的窄峰和宽峰调用、峰注释和基序识别,RNA-seq的差异基因表达分析,MNase-seq的核小体定位,DNase超敏位点图谱绘制,DNase-seq的位点注释和基序识别,来自全局运行转录(GRO-seq)数据的新生转录分析,以及ATAC-seq数据集的染色质可及性表征。此外,CIPHER包含一个“分析”模式,可完成复杂的生物信息学任务,如增强子发现,并提供将各种数据集整合在一起的功能。
使用公共数据和模拟数据,我们证明CIPHER是一个高效且全面的工作流程平台,可分析基因组生物学研究中常用的几种NGS数据集。此外,CIPHER的集成“分析”模式使研究人员能够从组合数据集分析中获取重要的生物学信息。