Smith Jason P, Corces M Ryan, Xu Jin, Reuter Vincent P, Chang Howard Y, Sheffield Nathan C
Center for Public Health Genomics, University of Virginia, VA,22908, USA.
Center for Personal Dynamic Regulomes, Stanford University, Stanford, CA 94304, USA.
NAR Genom Bioinform. 2021 Nov 23;3(4):lqab101. doi: 10.1093/nargab/lqab101. eCollection 2021 Dec.
As chromatin accessibility data from ATAC-seq experiments continues to expand, there is continuing need for standardized analysis pipelines. Here, we present PEPATAC, an ATAC-seq pipeline that is easily applied to ATAC-seq projects of any size, from one-off experiments to large-scale sequencing projects. PEPATAC leverages unique features of ATAC-seq data to optimize for speed and accuracy, and it provides several unique analytical approaches. Output includes convenient quality control plots, summary statistics, and a variety of generally useful data formats to set the groundwork for subsequent project-specific data analysis. Downstream analysis is simplified by a standard definition format, modularity of components, and metadata APIs in R and Python. It is restartable, fault-tolerant, and can be run on local hardware, using any cluster resource manager, or in provided Linux containers. We also demonstrate the advantage of aligning to the mitochondrial genome serially, which improves the accuracy of alignment statistics and quality control metrics. PEPATAC is a robust and portable first step for any ATAC-seq project. BSD2-licensed code and documentation are available at https://pepatac.databio.org.
随着来自ATAC-seq实验的染色质可及性数据不断扩展,对标准化分析流程的需求也持续存在。在此,我们展示了PEPATAC,这是一个ATAC-seq流程,可轻松应用于任何规模的ATAC-seq项目,从一次性实验到大规模测序项目。PEPATAC利用ATAC-seq数据的独特特征来优化速度和准确性,并提供了几种独特的分析方法。输出包括方便的质量控制图、汇总统计信息以及各种通常有用的数据格式,为后续特定项目的数据分析奠定基础。通过标准定义格式、组件的模块化以及R和Python中的元数据API简化了下游分析。它可重新启动、容错,并且可以在本地硬件上运行,使用任何集群资源管理器,或在提供的Linux容器中运行。我们还展示了依次比对线粒体基因组的优势,这提高了比对统计和质量控制指标的准确性。对于任何ATAC-seq项目,PEPATAC都是稳健且可移植的第一步。BSD2许可的代码和文档可在https://pepatac.databio.org获取。