Girard Mathilde, Vandamme Léa, Cazaux Bastien, Limasset Antoine
Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France.
Bioinform Adv. 2025 Jun 3;5(1):vbaf128. doi: 10.1093/bioadv/vbaf128. eCollection 2025.
Recent advances in high-throughput and third-generation sequencing technologies have created significant challenges in storing and managing the rapidly growing volume of read datasets. Although more than 50 specialized compression tools have been developed, employing methods such as reference-based approaches, customized generic compressors, and read reordering, many users still rely on common generic compressors (e.g. gzip, zstd, xz) for convenience, portability, and reliability, despite their low compression ratios. Here, we introduce Optimizing Read Order (OReO), a simple read-reordering framework that achieves high compression performance without requiring specialized software for decompression. By grouping overlapping reads together before applying generic compressors, OReO exploits inherent redundancies in sequencing data and achieves compression ratios on par with state-of-the-art tools. Moreover, because it relies only on standard decompressors, OReO avoids the need for dedicated installations and maintenance, removing a key barrier to practical adoption.
We evaluated OReO on both Oxford Nanopore Technologies (ONT) and HiFi genomic and metagenomic datasets of varying sizes and complexities. Our results demonstrate that OReO provides substantial compression gains with comparable resource usage and outperforms dedicated methods in decompression speed. We propose that future compression strategies should focus on reordering as a means to let generic compression tools fully exploit data redundancy, offering an efficient, sustainable, and user-friendly solution to the growing challenges of sequencing data storage.
The OReO code is open source and available at github.com/girunivlille/oreo.
高通量和第三代测序技术的最新进展给存储和管理快速增长的读取数据集带来了重大挑战。尽管已经开发了50多种专门的压缩工具,采用了基于参考的方法、定制的通用压缩器和读取重排序等方法,但许多用户仍然依赖常见的通用压缩器(如gzip、zstd、xz),因为它们方便、便携且可靠,尽管其压缩率较低。在此,我们介绍优化读取顺序(OReO),这是一个简单的读取重排序框架,无需专门的解压缩软件即可实现高压缩性能。通过在应用通用压缩器之前将重叠读取分组在一起,OReO利用了测序数据中固有的冗余,并实现了与最先进工具相当的压缩率。此外,由于它仅依赖标准解压缩器,OReO避免了需要专门的安装和维护,消除了实际应用的一个关键障碍。
我们在牛津纳米孔技术公司(ONT)以及不同大小和复杂度的HiFi基因组和宏基因组数据集上对OReO进行了评估。我们的结果表明,OReO在资源使用相当的情况下提供了显著的压缩增益,并且在解压缩速度方面优于专用方法。我们建议,未来的压缩策略应将重点放在重排序上,以此让通用压缩工具充分利用数据冗余,为测序数据存储日益增长的挑战提供一种高效、可持续且用户友好的解决方案。
OReO代码是开源的,可在github.com/girunivlille/oreo获取。