Department of Electronics and Information Systems, IDLab, Ghent University - IMEC, Ghent, Belgium.
Center for Biotech Data Science, Ghent University Global Campus, Songdo, Incheon 305-701, Republic of Korea.
Bioinformatics. 2018 Feb 1;34(3):425-433. doi: 10.1093/bioinformatics/btx607.
The past decade has seen the introduction of new technologies that significantly lowered the cost of genome sequencing. As a result, the amount of genomic data that must be stored and transmitted is increasing exponentially. To mitigate storage and transmission issues, we introduce a framework for lossless compression of quality scores.
This article proposes AQUa, an adaptive framework for lossless compression of quality scores. To compress these quality scores, AQUa makes use of a configurable set of coding tools, extended with a Context-Adaptive Binary Arithmetic Coding scheme. When benchmarking AQUa against generic single-pass compressors, file sizes are reduced by up to 38.49% when comparing with GNU Gzip and by up to 6.48% when comparing with 7-Zip at the Ultra Setting, while still providing support for random access. When comparing AQUa with the purpose-built, single-pass, and state-of-the-art compressor SCALCE, which does not support random access, file sizes are reduced by up to 21.14%. When comparing AQUa with the purpose-built, dual-pass, and state-of-the-art compressor QVZ, which does not support random access, file sizes are larger by 6.42-33.47%. However, for one test file, the file size is 0.38% smaller, illustrating the strength of our single-pass compression framework. This work has been spurred by the current activity on genomic information representation (MPEG-G) within the ISO/IEC SC29/WG11 technical committee.
The software is available on Github: https://github.com/tparidae/AQUa.
过去十年见证了新技术的引入,这些技术大大降低了基因组测序的成本。因此,必须存储和传输的基因组数据量正在呈指数级增长。为了解决存储和传输问题,我们引入了一种无损压缩质量得分的框架。
本文提出了 AQUa,这是一种用于无损压缩质量得分的自适应框架。为了压缩这些质量得分,AQUa 使用了一组可配置的编码工具,扩展了上下文自适应二进制算术编码方案。在与通用单遍压缩器进行基准测试时,与 GNU Gzip 相比,文件大小最多可减少 38.49%,与 7-Zip 的 Ultra 设置相比,文件大小最多可减少 6.48%,同时仍支持随机访问。与专为单遍使用且处于最先进水平的压缩器 SCALCE 相比,它不支持随机访问,AQUa 可将文件大小最多减少 21.14%。与专为双遍使用且处于最先进水平的压缩器 QVZ 相比,它不支持随机访问,AQUa 的文件大小会增加 6.42%-33.47%。但是,对于一个测试文件,文件大小减少了 0.38%,这表明我们的单遍压缩框架具有优势。这项工作是由 ISO/IEC SC29/WG11 技术委员会目前在基因组信息表示(MPEG-G)方面的活动所推动的。
软件可在 Github 上获得:https://github.com/tparidae/AQUa。