Suppr超能文献

LW-FQZip 2:FASTQ文件的并行化基于参考的压缩

LW-FQZip 2: a parallelized reference-based compression of FASTQ files.

作者信息

Huang Zhi-An, Wen Zhenkun, Deng Qingjin, Chu Ying, Sun Yiwen, Zhu Zexuan

机构信息

College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060, China.

School of Medicine, Shenzhen University, Shenzhen, 518060, China.

出版信息

BMC Bioinformatics. 2017 Mar 20;18(1):179. doi: 10.1186/s12859-017-1588-x.

Abstract

BACKGROUND

The rapid progress of high-throughput DNA sequencing techniques has dramatically reduced the costs of whole genome sequencing, which leads to revolutionary advances in gene industry. The explosively increasing volume of raw data outpaces the decreasing disk cost and the storage of huge sequencing data has become a bottleneck of downstream analyses. Data compression is considered as a solution to reduce the dependency on storage. Efficient sequencing data compression methods are highly demanded.

RESULTS

In this article, we present a lossless reference-based compression method namely LW-FQZip 2 targeted at FASTQ files. LW-FQZip 2 is improved from LW-FQZip 1 by introducing more efficient coding scheme and parallelism. Particularly, LW-FQZip 2 is equipped with a light-weight mapping model, bitwise prediction by partial matching model, arithmetic coding, and multi-threading parallelism. LW-FQZip 2 is evaluated on both short-read and long-read data generated from various sequencing platforms. The experimental results show that LW-FQZip 2 is able to obtain promising compression ratios at reasonable time and memory space costs.

CONCLUSIONS

The competence enables LW-FQZip 2 to serve as a candidate tool for archival or space-sensitive applications of high-throughput DNA sequencing data. LW-FQZip 2 is freely available at http://csse.szu.edu.cn/staff/zhuzx/LWFQZip2 and https://github.com/Zhuzxlab/LW-FQZip2 .

摘要

背景

高通量DNA测序技术的快速发展极大地降低了全基因组测序的成本,这推动了基因产业的革命性进展。原始数据量的爆炸式增长超过了磁盘成本的下降,海量测序数据的存储已成为下游分析的瓶颈。数据压缩被认为是减少对存储依赖的一种解决方案。因此,高效的测序数据压缩方法备受需求。

结果

在本文中,我们提出了一种针对FASTQ文件的基于参考的无损压缩方法,即LW-FQZip 2。LW-FQZip 2是在LW-FQZip 1的基础上改进而来,引入了更高效的编码方案和并行性。具体而言,LW-FQZip 2配备了轻量级映射模型、基于部分匹配模型的按位预测、算术编码和多线程并行性。我们在来自各种测序平台生成的短读长和长读长数据上对LW-FQZip 2进行了评估。实验结果表明,LW-FQZip 2能够在合理的时间和内存空间成本下获得可观的压缩率。

结论

LW-FQZip 2的性能使其能够成为高通量DNA测序数据存档或对空间敏感应用的候选工具。LW-FQZip 2可在http://csse.szu.edu.cn/staff/zhuzx/LWFQZip2和https://github.com/Zhuzxlab/LW-FQZip2上免费获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8a67/5359991/71b7a44df88a/12859_2017_1588_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验