Huang Zhi-An, Wen Zhenkun, Deng Qingjin, Chu Ying, Sun Yiwen, Zhu Zexuan
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060, China.
School of Medicine, Shenzhen University, Shenzhen, 518060, China.
BMC Bioinformatics. 2017 Mar 20;18(1):179. doi: 10.1186/s12859-017-1588-x.
The rapid progress of high-throughput DNA sequencing techniques has dramatically reduced the costs of whole genome sequencing, which leads to revolutionary advances in gene industry. The explosively increasing volume of raw data outpaces the decreasing disk cost and the storage of huge sequencing data has become a bottleneck of downstream analyses. Data compression is considered as a solution to reduce the dependency on storage. Efficient sequencing data compression methods are highly demanded.
In this article, we present a lossless reference-based compression method namely LW-FQZip 2 targeted at FASTQ files. LW-FQZip 2 is improved from LW-FQZip 1 by introducing more efficient coding scheme and parallelism. Particularly, LW-FQZip 2 is equipped with a light-weight mapping model, bitwise prediction by partial matching model, arithmetic coding, and multi-threading parallelism. LW-FQZip 2 is evaluated on both short-read and long-read data generated from various sequencing platforms. The experimental results show that LW-FQZip 2 is able to obtain promising compression ratios at reasonable time and memory space costs.
The competence enables LW-FQZip 2 to serve as a candidate tool for archival or space-sensitive applications of high-throughput DNA sequencing data. LW-FQZip 2 is freely available at http://csse.szu.edu.cn/staff/zhuzx/LWFQZip2 and https://github.com/Zhuzxlab/LW-FQZip2 .
高通量DNA测序技术的快速发展极大地降低了全基因组测序的成本,这推动了基因产业的革命性进展。原始数据量的爆炸式增长超过了磁盘成本的下降,海量测序数据的存储已成为下游分析的瓶颈。数据压缩被认为是减少对存储依赖的一种解决方案。因此,高效的测序数据压缩方法备受需求。
在本文中,我们提出了一种针对FASTQ文件的基于参考的无损压缩方法,即LW-FQZip 2。LW-FQZip 2是在LW-FQZip 1的基础上改进而来,引入了更高效的编码方案和并行性。具体而言,LW-FQZip 2配备了轻量级映射模型、基于部分匹配模型的按位预测、算术编码和多线程并行性。我们在来自各种测序平台生成的短读长和长读长数据上对LW-FQZip 2进行了评估。实验结果表明,LW-FQZip 2能够在合理的时间和内存空间成本下获得可观的压缩率。
LW-FQZip 2的性能使其能够成为高通量DNA测序数据存档或对空间敏感应用的候选工具。LW-FQZip 2可在http://csse.szu.edu.cn/staff/zhuzx/LWFQZip2和https://github.com/Zhuzxlab/LW-FQZip2上免费获取。