Department of Electrical and Computer Engineering, University of California, San Diego, La Jolla, CA 92093, USA.
The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA.
Bioinformatics. 2020 May 1;36(10):3234-3235. doi: 10.1093/bioinformatics/btaa061.
Modern genomic research is driven by next-generation sequencing experiments such as ChIP-seq and ChIA-PET that generate coverage files for transcription factor binding, as well as DHS and ATAC-seq that yield coverage files for chromatin accessibility. Such files are in a bedGraph text format or a bigWig binary format. Obtaining summary statistics in a given region is a fundamental task in analyzing protein binding intensity or chromatin accessibility. However, the existing Python package for operating on coverage files is not optimized for speed.
We developed pyBedGraph, a Python package to quickly obtain summary statistics for a given interval in a bedGraph or a bigWig file. When tested on 12 ChIP-seq, ATAC-seq, RNA-seq and ChIA-PET datasets, pyBedGraph is on average 260 times faster than the existing program pyBigWig. On average, pyBedGraph can look up the exact mean signal of 1 million regions in ∼0.26 s and can compute their approximate means in <0.12 s on a conventional laptop.
pyBedGraph is publicly available at https://github.com/TheJacksonLaboratory/pyBedGraph under the MIT license.
Supplementary data are available at Bioinformatics online.
现代基因组学研究受到下一代测序实验的推动,例如 ChIP-seq 和 ChIA-PET,它们生成转录因子结合的覆盖文件,以及 DHS 和 ATAC-seq,它们生成染色质可及性的覆盖文件。这些文件采用 bedGraph 文本格式或 bigWig 二进制格式。在给定区域获取汇总统计信息是分析蛋白质结合强度或染色质可及性的基本任务。然而,用于操作覆盖文件的现有 Python 包不是针对速度进行优化的。
我们开发了 pyBedGraph,这是一个 Python 包,用于快速获取 bedGraph 或 bigWig 文件中给定区间的汇总统计信息。在 12 个 ChIP-seq、ATAC-seq、RNA-seq 和 ChIA-PET 数据集上进行测试时,pyBedGraph 的速度平均比现有程序 pyBigWig 快 260 倍。平均而言,pyBedGraph 可以在约 0.26 秒内查找 100 万个区域的确切平均信号,并可以在传统笔记本电脑上在 <0.12 秒内计算它们的近似平均值。
pyBedGraph 在 MIT 许可证下可在 https://github.com/TheJacksonLaboratory/pyBedGraph 上公开获得。
补充数据可在 Bioinformatics 在线获得。