Department of Electronic and Electrical Engineering, Hongik University, Seoul, Republic of Korea.
BMC Bioinformatics. 2021 Dec 20;22(1):606. doi: 10.1186/s12859-021-04516-7.
Advances in sequencing technology have drastically reduced sequencing costs. As a result, the amount of sequencing data increases explosively. Since FASTQ files (standard sequencing data formats) are huge, there is a need for efficient compression of FASTQ files, especially quality scores. Several quality scores compression algorithms are recently proposed, mainly focused on lossy compression to boost the compression rate further. However, for clinical applications and archiving purposes, lossy compression cannot replace lossless compression. One of the main challenges for lossless compression is time complexity, where it takes thousands of seconds to compress a 1 GB file. Also, there are desired features for compression algorithms, such as random access. Therefore, there is a need for a fast lossless compressor with a reasonable compression rate and random access functionality.
This paper proposes a Fast and Concurrent Lossless Quality scores Compressor (FCLQC) that supports random access and achieves a lower running time based on concurrent programming. Experimental results reveal that FCLQC is significantly faster than the baseline compressors on compression and decompression at the expense of compression ratio. Compared to LCQS (baseline quality score compression algorithm), FCLQC shows at least 31x compression speed improvement in all settings, where a performance degradation in compression ratio is up to 13.58% (8.26% on average). Compared to general-purpose compressors (such as 7-zip), FCLQC shows 3x faster compression speed while having better compression ratios, at least 2.08% (4.69% on average). Moreover, the speed of random access decompression also outperforms the others. The concurrency of FCLQC is implemented using Rust; the performance gain increases near-linearly with the number of threads.
The superiority of compression and decompression speed makes FCLQC a practical lossless quality score compressor candidate for speed-sensitive applications of DNA sequencing data. FCLQC is available at https://github.com/Minhyeok01/FCLQC and is freely available for non-commercial usage.
测序技术的进步极大地降低了测序成本。因此,测序数据量呈爆炸式增长。由于 FASTQ 文件(标准测序数据格式)非常大,因此需要对 FASTQ 文件进行高效压缩,特别是质量分数。最近提出了几种质量分数压缩算法,主要侧重于进一步提高压缩率的有损压缩。然而,对于临床应用和存档目的,有损压缩不能替代无损压缩。无损压缩的主要挑战之一是时间复杂度,压缩 1GB 文件可能需要数千秒。此外,压缩算法还需要一些理想的功能,例如随机访问。因此,需要一种具有合理压缩率和随机访问功能的快速无损压缩器。
本文提出了一种快速并发无损质量分数压缩器(FCLQC),它支持随机访问,并基于并发编程实现了更短的运行时间。实验结果表明,FCLQC 在压缩和解压缩方面明显快于基线压缩器,但其压缩比会有所下降。与 LCQS(基线质量分数压缩算法)相比,在所有设置下,FCLQC 的压缩速度至少提高了 31 倍,而压缩比的性能下降幅度最高可达 13.58%(平均为 4.69%)。与通用压缩器(如 7-zip)相比,FCLQC 的压缩速度快 3 倍,同时具有更好的压缩比,至少提高 2.08%(平均提高 4.69%)。此外,随机访问解压缩的速度也优于其他方法。FCLQC 的并发性是使用 Rust 实现的;随着线程数量的增加,性能增益呈近线性增长。
FCLQC 在压缩和解压缩速度方面的优势使其成为 DNA 测序数据速度敏感型应用的实用无损质量分数压缩器候选者。FCLQC 可在 https://github.com/Minhyeok01/FCLQC 上获得,可免费用于非商业用途。