Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China.
Institute of Artificial Intelligence, School of Electrical Engineering, Guangxi University, Nanning 530004, China.
Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae323.
The quality scores data (QSD) account for 70% in compressed FastQ files obtained from the short and long reads sequencing technologies. Designing effective compressors for QSD that counterbalance compression ratio, time cost, and memory consumption is essential in scenarios such as large-scale genomics data sharing and long-term data backup. This study presents a novel parallel lossless QSD-dedicated compression algorithm named PQSDC, which fulfills the above requirements well. PQSDC is based on two core components: a parallel sequences-partition model designed to reduce peak memory consumption and time cost during compression and decompression processes, as well as a parallel four-level run-length prediction mapping model to enhance compression ratio. Besides, the PQSDC algorithm is also designed to be highly concurrent using multicore CPU clusters.
We evaluate PQSDC and four state-of-the-art compression algorithms on 27 real-world datasets, including 61.857 billion QSD characters and 632.908 million QSD sequences. (1) For short reads, compared to baselines, the maximum improvement of PQSDC reaches 7.06% in average compression ratio, and 8.01% in weighted average compression ratio. During compression and decompression, the maximum total time savings of PQSDC are 79.96% and 84.56%, respectively; the maximum average memory savings are 68.34% and 77.63%, respectively. (2) For long reads, the maximum improvement of PQSDC reaches 12.51% and 13.42% in average and weighted average compression ratio, respectively. The maximum total time savings during compression and decompression are 53.51% and 72.53%, respectively; the maximum average memory savings are 19.44% and 17.42%, respectively. (3) Furthermore, PQSDC ranks second in compression robustness among the tested algorithms, indicating that it is less affected by the probability distribution of the QSD collections. Overall, our work provides a promising solution for QSD parallel compression, which balances storage cost, time consumption, and memory occupation primely.
The proposed PQSDC compressor can be downloaded from https://github.com/fahaihi/PQSDC.
质量分数数据(QSD)在短读和长读测序技术获得的压缩 FastQ 文件中占 70%。设计针对 QSD 的有效压缩器,在大规模基因组数据共享和长期数据备份等场景中平衡压缩比、时间成本和内存消耗至关重要。本研究提出了一种新颖的并行无损 QSD 专用压缩算法,名为 PQSDC,它很好地满足了上述要求。PQSDC 基于两个核心组件:一种并行序列分区模型,旨在降低压缩和解压缩过程中的峰值内存消耗和时间成本,以及一种并行四级游程长度预测映射模型,以提高压缩比。此外,PQSDC 算法还设计为使用多核 CPU 集群高度并发。
我们在 27 个真实数据集上评估了 PQSDC 和四种最先进的压缩算法,包括 618.57 亿个 QSD 字符和 6329.08 万条 QSD 序列。(1)对于短读,与基线相比,PQSDC 的平均压缩比最大提高 7.06%,加权平均压缩比最大提高 8.01%。在压缩和解压缩过程中,PQSDC 的最大总时间节省分别为 79.96%和 84.56%;最大平均内存节省分别为 68.34%和 77.63%。(2)对于长读,PQSDC 的最大提高达到平均和加权平均压缩比的 12.51%和 13.42%。在压缩和解压缩过程中,最大总时间节省分别为 53.51%和 72.53%;最大平均内存节省分别为 19.44%和 17.42%。(3)此外,PQSDC 在测试算法中压缩稳健性排名第二,表明它受 QSD 集合概率分布的影响较小。总体而言,我们的工作为 QSD 并行压缩提供了一个有前途的解决方案,它很好地平衡了存储成本、时间消耗和内存占用。
可以从 https://github.com/fahaihi/PQSDC 下载提出的 PQSDC 压缩器。