Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
The Henry Samueli School of Engineering, Center for Pervasive Communications and Computing (CPCC), University of California, Irvine, CA 92697, USA.
Bioinformatics. 2018 Mar 15;34(6):911-919. doi: 10.1093/bioinformatics/btx685.
Chromatin immunoprecipitation sequencing (ChIP-seq) experiments are inexpensive and time-efficient, and result in massive datasets that introduce significant storage and maintenance challenges. To address the resulting Big Data problems, we propose a lossless and lossy compression framework specifically designed for ChIP-seq Wig data, termed ChIPWig. ChIPWig enables random access, summary statistics lookups and it is based on the asymptotic theory of optimal point density design for nonuniform quantizers.
We tested the ChIPWig compressor on 10 ChIP-seq datasets generated by the ENCODE consortium. On average, lossless ChIPWig reduced the file sizes to merely 6% of the original, and offered 6-fold compression rate improvement compared to bigWig. The lossy feature further reduced file sizes 2-fold compared to the lossless mode, with little or no effects on peak calling and motif discovery using specialized NarrowPeaks methods. The compression and decompression speed rates are of the order of 0.2 sec/MB using general purpose computers.
The source code and binaries are freely available for download at https://github.com/vidarmehr/ChIPWig-v2, implemented in C ++.
Supplementary data are available at Bioinformatics online.
染色质免疫沉淀测序(ChIP-seq)实验成本低廉且耗时短,产生的大量数据集带来了巨大的存储和维护挑战。为了解决由此产生的大数据问题,我们提出了一种专门针对 ChIP-seq Wig 数据的无损和有损压缩框架,称为 ChIPWig。ChIPWig 支持随机访问、汇总统计信息查询,它基于非均匀量化器最优点密度设计的渐近理论。
我们在 ENCODE 联盟生成的 10 个 ChIP-seq 数据集上测试了 ChIPWig 压缩器。平均而言,无损 ChIPWig 将文件大小减少到原始文件的 6%,与 bigWig 相比,压缩率提高了 6 倍。与无损模式相比,有损模式进一步将文件大小减少了 2 倍,而使用专门的 NarrowPeaks 方法进行峰值调用和基序发现几乎没有影响。使用通用计算机,压缩和解压缩的速度约为 0.2 秒/MB。
源代码和二进制文件可在 https://github.com/vidarmehr/ChIPWig-v2 上免费下载,用 C++实现。
补充数据可在 Bioinformatics 在线获得。