OReO：优化实际压缩的读取顺序

OReO: optimizing read order for practical compression.

作者信息

Girard Mathilde, Vandamme Léa, Cazaux Bastien, Limasset Antoine

机构信息

Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France.

出版信息

Bioinform Adv. 2025 Jun 3;5(1):vbaf128. doi: 10.1093/bioadv/vbaf128. eCollection 2025.

DOI:10.1093/bioadv/vbaf128

PMID:40557420

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12185860/

Abstract

MOTIVATION

Recent advances in high-throughput and third-generation sequencing technologies have created significant challenges in storing and managing the rapidly growing volume of read datasets. Although more than 50 specialized compression tools have been developed, employing methods such as reference-based approaches, customized generic compressors, and read reordering, many users still rely on common generic compressors (e.g. gzip, zstd, xz) for convenience, portability, and reliability, despite their low compression ratios. Here, we introduce Optimizing Read Order (OReO), a simple read-reordering framework that achieves high compression performance without requiring specialized software for decompression. By grouping overlapping reads together before applying generic compressors, OReO exploits inherent redundancies in sequencing data and achieves compression ratios on par with state-of-the-art tools. Moreover, because it relies only on standard decompressors, OReO avoids the need for dedicated installations and maintenance, removing a key barrier to practical adoption.

RESULTS

We evaluated OReO on both Oxford Nanopore Technologies (ONT) and HiFi genomic and metagenomic datasets of varying sizes and complexities. Our results demonstrate that OReO provides substantial compression gains with comparable resource usage and outperforms dedicated methods in decompression speed. We propose that future compression strategies should focus on reordering as a means to let generic compression tools fully exploit data redundancy, offering an efficient, sustainable, and user-friendly solution to the growing challenges of sequencing data storage.

AVAILABILITY AND IMPLEMENTATION

The OReO code is open source and available at github.com/girunivlille/oreo.

摘要

动机

高通量和第三代测序技术的最新进展给存储和管理快速增长的读取数据集带来了重大挑战。尽管已经开发了50多种专门的压缩工具，采用了基于参考的方法、定制的通用压缩器和读取重排序等方法，但许多用户仍然依赖常见的通用压缩器（如gzip、zstd、xz），因为它们方便、便携且可靠，尽管其压缩率较低。在此，我们介绍优化读取顺序（OReO），这是一个简单的读取重排序框架，无需专门的解压缩软件即可实现高压缩性能。通过在应用通用压缩器之前将重叠读取分组在一起，OReO利用了测序数据中固有的冗余，并实现了与最先进工具相当的压缩率。此外，由于它仅依赖标准解压缩器，OReO避免了需要专门的安装和维护，消除了实际应用的一个关键障碍。

结果

我们在牛津纳米孔技术公司（ONT）以及不同大小和复杂度的HiFi基因组和宏基因组数据集上对OReO进行了评估。我们的结果表明，OReO在资源使用相当的情况下提供了显著的压缩增益，并且在解压缩速度方面优于专用方法。我们建议，未来的压缩策略应将重点放在重排序上，以此让通用压缩工具充分利用数据冗余，为测序数据存储日益增长的挑战提供一种高效、可持续且用户友好的解决方案。

可用性和实现方式

OReO代码是开源的，可在github.com/girunivlille/oreo获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4ace/12185860/246b5d6edd68/vbaf128f1.jpg

相似文献

OReO: optimizing read order for practical compression.

Bioinform Adv. 2025 Jun 3;5(1):vbaf128. doi: 10.1093/bioadv/vbaf128. eCollection 2025.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Antidepressants for pain management in adults with chronic pain: a network meta-analysis.

Health Technol Assess. 2024 Oct;28(62):1-155. doi: 10.3310/MKRT2948.

Diagnostic test accuracy and cost-effectiveness of tests for codeletion of chromosomal arms 1p and 19q in people with glioma.

Cochrane Database Syst Rev. 2022 Mar 2;3(3):CD013387. doi: 10.1002/14651858.CD013387.pub2.

Home treatment for mental health problems: a systematic review.

Health Technol Assess. 2001;5(15):1-139. doi: 10.3310/hta5150.

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.

Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.

A rapid and systematic review of the clinical effectiveness and cost-effectiveness of paclitaxel, docetaxel, gemcitabine and vinorelbine in non-small-cell lung cancer.

Health Technol Assess. 2001;5(32):1-195. doi: 10.3310/hta5320.

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.

Cochrane Database Syst Rev. 2017 Dec 22;12(12):CD011535. doi: 10.1002/14651858.CD011535.pub2.

Intravenous magnesium sulphate and sotalol for prevention of atrial fibrillation after coronary artery bypass surgery: a systematic review and economic evaluation.

Health Technol Assess. 2008 Jun;12(28):iii-iv, ix-95. doi: 10.3310/hta12280.

Behavioral interventions to reduce risk for sexual transmission of HIV among men who have sex with men.

Cochrane Database Syst Rev. 2008 Jul 16(3):CD001230. doi: 10.1002/14651858.CD001230.pub2.

本文引用的文献

PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering.

BMC Bioinformatics. 2023 Nov 30;24(1):454. doi: 10.1186/s12859-023-05566-9.

Comparison of R9.4.1/Kit10 and R10/Kit12 Oxford Nanopore flowcells and chemistries in bacterial genome reconstruction.

Microb Genom. 2023 Jan;9(1). doi: 10.1099/mgen.0.000910.

Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach.

Sci Rep. 2023 Feb 6;13(1):2082. doi: 10.1038/s41598-023-29267-8.

CURC: a CUDA-based reference-free read compressor.

Bioinformatics. 2022 Jun 13;38(12):3294-3296. doi: 10.1093/bioinformatics/btac333.

CoLoRd: compressing long reads.

Nat Methods. 2022 Apr;19(4):441-444. doi: 10.1038/s41592-022-01432-3. Epub 2022 Mar 28.

Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads.

Nat Biotechnol. 2022 Jul;40(7):1075-1081. doi: 10.1038/s41587-022-01220-6. Epub 2022 Feb 28.

FastqCLS: a FASTQ compressor for long-read sequencing via read reordering using a novel scoring model.

Bioinformatics. 2022 Jan 3;38(2):351-356. doi: 10.1093/bioinformatics/btab696.

Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer.

Cell Syst. 2021 Oct 20;12(10):958-968.e6. doi: 10.1016/j.cels.2021.08.009. Epub 2021 Sep 14.

Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression.

PLoS Comput Biol. 2021 Jul 19;17(7):e1009229. doi: 10.1371/journal.pcbi.1009229. eCollection 2021 Jul.

RENANO: a REference-based compressor for NANOpore FASTQ files.

Bioinformatics. 2021 Dec 11;37(24):4862-4864. doi: 10.1093/bioinformatics/btab437.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

OReO：优化实际压缩的读取顺序

OReO: optimizing read order for practical compression.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现方式

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献