PQSDC：一种通过序列划分和游程长度预测映射对质量分数数据进行并行无损压缩的方法。

PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping.

机构信息

Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China.

Institute of Artificial Intelligence, School of Electrical Engineering, Guangxi University, Nanning 530004, China.

出版信息

Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae323.

DOI:10.1093/bioinformatics/btae323

PMID:38759114

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11139522/

Abstract

MOTIVATION

The quality scores data (QSD) account for 70% in compressed FastQ files obtained from the short and long reads sequencing technologies. Designing effective compressors for QSD that counterbalance compression ratio, time cost, and memory consumption is essential in scenarios such as large-scale genomics data sharing and long-term data backup. This study presents a novel parallel lossless QSD-dedicated compression algorithm named PQSDC, which fulfills the above requirements well. PQSDC is based on two core components: a parallel sequences-partition model designed to reduce peak memory consumption and time cost during compression and decompression processes, as well as a parallel four-level run-length prediction mapping model to enhance compression ratio. Besides, the PQSDC algorithm is also designed to be highly concurrent using multicore CPU clusters.

RESULTS

We evaluate PQSDC and four state-of-the-art compression algorithms on 27 real-world datasets, including 61.857 billion QSD characters and 632.908 million QSD sequences. (1) For short reads, compared to baselines, the maximum improvement of PQSDC reaches 7.06% in average compression ratio, and 8.01% in weighted average compression ratio. During compression and decompression, the maximum total time savings of PQSDC are 79.96% and 84.56%, respectively; the maximum average memory savings are 68.34% and 77.63%, respectively. (2) For long reads, the maximum improvement of PQSDC reaches 12.51% and 13.42% in average and weighted average compression ratio, respectively. The maximum total time savings during compression and decompression are 53.51% and 72.53%, respectively; the maximum average memory savings are 19.44% and 17.42%, respectively. (3) Furthermore, PQSDC ranks second in compression robustness among the tested algorithms, indicating that it is less affected by the probability distribution of the QSD collections. Overall, our work provides a promising solution for QSD parallel compression, which balances storage cost, time consumption, and memory occupation primely.

AVAILABILITY AND IMPLEMENTATION

The proposed PQSDC compressor can be downloaded from https://github.com/fahaihi/PQSDC.

摘要

动机

质量分数数据（QSD）在短读和长读测序技术获得的压缩 FastQ 文件中占 70%。设计针对 QSD 的有效压缩器，在大规模基因组数据共享和长期数据备份等场景中平衡压缩比、时间成本和内存消耗至关重要。本研究提出了一种新颖的并行无损 QSD 专用压缩算法，名为 PQSDC，它很好地满足了上述要求。PQSDC 基于两个核心组件：一种并行序列分区模型，旨在降低压缩和解压缩过程中的峰值内存消耗和时间成本，以及一种并行四级游程长度预测映射模型，以提高压缩比。此外，PQSDC 算法还设计为使用多核 CPU 集群高度并发。

结果

我们在 27 个真实数据集上评估了 PQSDC 和四种最先进的压缩算法，包括 618.57 亿个 QSD 字符和 6329.08 万条 QSD 序列。（1）对于短读，与基线相比，PQSDC 的平均压缩比最大提高 7.06%，加权平均压缩比最大提高 8.01%。在压缩和解压缩过程中，PQSDC 的最大总时间节省分别为 79.96%和 84.56%；最大平均内存节省分别为 68.34%和 77.63%。（2）对于长读，PQSDC 的最大提高达到平均和加权平均压缩比的 12.51%和 13.42%。在压缩和解压缩过程中，最大总时间节省分别为 53.51%和 72.53%；最大平均内存节省分别为 19.44%和 17.42%。（3）此外，PQSDC 在测试算法中压缩稳健性排名第二，表明它受 QSD 集合概率分布的影响较小。总体而言，我们的工作为 QSD 并行压缩提供了一个有前途的解决方案，它很好地平衡了存储成本、时间消耗和内存占用。

可用性和实现

可以从 https://github.com/fahaihi/PQSDC 下载提出的 PQSDC 压缩器。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c598/11139522/55855e25196e/btae323f1.jpg

相似文献

PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping.

Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae323.

PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering.

BMC Bioinformatics. 2023 Nov 30;24(1):454. doi: 10.1186/s12859-023-05566-9.

SPRING: a next-generation compressor for FASTQ data.

Bioinformatics. 2019 Aug 1;35(15):2674-2676. doi: 10.1093/bioinformatics/bty1015.

LCQS: an efficient lossless compression tool of quality scores with random access functionality.

BMC Bioinformatics. 2020 Mar 18;21(1):109. doi: 10.1186/s12859-020-3428-7.

LFQC: a lossless compression algorithm for FASTQ files.

Bioinformatics. 2015 Oct 15;31(20):3276-81. doi: 10.1093/bioinformatics/btv384. Epub 2015 Jun 20.

PgRC: pseudogenome-based read compressor.

Bioinformatics. 2020 Apr 1;36(7):2082-2089. doi: 10.1093/bioinformatics/btz919.

ENANO: Encoder for NANOpore FASTQ files.

Bioinformatics. 2020 Aug 15;36(16):4506-4507. doi: 10.1093/bioinformatics/btaa551.

FaStore: a space-saving solution for raw sequencing data.

Bioinformatics. 2018 Aug 15;34(16):2748-2756. doi: 10.1093/bioinformatics/bty205.

CALQ: compression of quality values of aligned sequencing data.

Bioinformatics. 2018 May 15;34(10):1650-1658. doi: 10.1093/bioinformatics/btx737.

RENANO: a REference-based compressor for NANOpore FASTQ files.

Bioinformatics. 2021 Dec 11;37(24):4862-4864. doi: 10.1093/bioinformatics/btab437.

本文引用的文献

PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering.

BMC Bioinformatics. 2023 Nov 30;24(1):454. doi: 10.1186/s12859-023-05566-9.

CMIC: an efficient quality score compressor with random access functionality.

BMC Bioinformatics. 2022 Jul 23;23(1):294. doi: 10.1186/s12859-022-04837-1.

CoLoRd: compressing long reads.

Nat Methods. 2022 Apr;19(4):441-444. doi: 10.1038/s41592-022-01432-3. Epub 2022 Mar 28.

FCLQC: fast and concurrent lossless quality scores compressor.

BMC Bioinformatics. 2021 Dec 20;22(1):606. doi: 10.1186/s12859-021-04516-7.

FastqCLS: a FASTQ compressor for long-read sequencing via read reordering using a novel scoring model.

Bioinformatics. 2022 Jan 3;38(2):351-356. doi: 10.1093/bioinformatics/btab696.

Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression.

PLoS Comput Biol. 2021 Jul 19;17(7):e1009229. doi: 10.1371/journal.pcbi.1009229. eCollection 2021 Jul.

RENANO: a REference-based compressor for NANOpore FASTQ files.

Bioinformatics. 2021 Dec 11;37(24):4862-4864. doi: 10.1093/bioinformatics/btab437.

Genozip: a universal extensible genomic data compressor.

Bioinformatics. 2021 Aug 25;37(16):2225-2230. doi: 10.1093/bioinformatics/btab102.

CROMqs: An infinitesimal successive refinement lossy compressor for the quality scores.

J Bioinform Comput Biol. 2020 Dec;18(6):2050031. doi: 10.1142/S0219720020500316. Epub 2020 Sep 16.

CNSA: a data repository for archiving omics data.

Database (Oxford). 2020 Jan 1;2020. doi: 10.1093/database/baaa055.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

PQSDC：一种通过序列划分和游程长度预测映射对质量分数数据进行并行无损压缩的方法。

PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping.

机构信息

Nankai-Baidu Joint Laboratory, Parallel and Distributed Software Technology Laboratory, TMCC, SysNet, DISSec, GTIISC, College of Computer Science, Nankai University, Tianjin 300350, China.

Institute of Artificial Intelligence, School of Electrical Engineering, Guangxi University, Nanning 530004, China.

出版信息

Bioinformatics. 2024 May 2;40(5). doi: 10.1093/bioinformatics/btae323.

DOI:10.1093/bioinformatics/btae323

PMID:38759114

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11139522/

Abstract

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

The proposed PQSDC compressor can be downloaded from https://github.com/fahaihi/PQSDC.

摘要

动机

结果

可用性和实现

可以从 https://github.com/fahaihi/PQSDC 下载提出的 PQSDC 压缩器。

PQSDC：一种通过序列划分和游程长度预测映射对质量分数数据进行并行无损压缩的方法。

PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

PQSDC：一种通过序列划分和游程长度预测映射对质量分数数据进行并行无损压缩的方法。

PQSDC: a parallel lossless compressor for quality scores data via sequences partition and run-length prediction mapping.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

本文引用的文献