Betschart Raphael O, Thalén Felix, Blankenberg Stefan, Zoche Martin, Zeller Tanja, Ziegler Andreas
Cardio-CARE, Medizincampus Davos, Herman-Burchard-Str. 12, Davos Wolfgang, 7265, Davos, Switzerland.
Institute of Cardiogenetics, University of Lübeck, Lübeck, Germany.
Sci Rep. 2025 May 2;15(1):15358. doi: 10.1038/s41598-025-00491-8.
Efficient data compression technologies are crucial to reduce the cost of long-term storage and file transfer in whole genome sequencing studies. This study benchmarked four specialized compression tools developed for paired-end fastq.gz files DRAGEN ORA 4.3.4 (ORA), Genozip 15.0.62, repaq 0.3.0, and SPRING 1.1.1 using three subjects from the genome-in-a-bottle consortium that were sequenced 82 times on an Illumina NovaSeq 6000, with an average coverage of 35x. It additionally compared Genozip with SAMtools 1.20 for the compression of BAM files. All tools provided lossless compression. ORA and Genozip achieved compression ratios of approximately 1:6 when compressing fastq.gz. repaq and SPRING had lower compression ratios of 1:2 and 1:4, respectively. repaq and SPRING took longer for both compression and decompression than ORA and Genozip. Genozip had approximately 16% higher compression for BAM files than SAMtools. However, the BAM compression of SAMtools produces CRAM files, which are compatible with many software packages. ORA, repaq, and SPRING are limited to compressing fastq.gz files, while Genozip supports various file formats. Although Genozip requires an annual license, its source code is freely available, ensuring sustainability. In conclusion, paired-end short-read sequence data can be efficiently compressed using specialized compression software. Commercial tools offer higher compression ratios than freely available software.
高效的数据压缩技术对于降低全基因组测序研究中的长期存储成本和文件传输成本至关重要。本研究对为双端fastq.gz文件开发的四种专用压缩工具进行了基准测试,即DRAGEN ORA 4.3.4(ORA)、Genozip 15.0.62、repaq 0.3.0和SPRING 1.1.1,使用了来自基因组瓶子联盟的三个样本,这些样本在Illumina NovaSeq 6000上进行了82次测序,平均覆盖度为35倍。此外,还比较了Genozip和SAMtools 1.20对BAM文件的压缩情况。所有工具都提供无损压缩。压缩fastq.gz文件时,ORA和Genozip的压缩比约为1:6。repaq和SPRING的压缩比分别较低,为1:2和1:4。repaq和SPRING的压缩和解压缩时间都比ORA和Genozip长。Genozip对BAM文件的压缩比SAMtools高约16%。然而,SAMtools的BAM压缩会生成CRAM文件,这些文件与许多软件包兼容。ORA、repaq和SPRING仅限于压缩fastq.gz文件,而Genozip支持各种文件格式。虽然Genozip需要年度许可证,但其源代码可免费获取,确保了可持续性。总之,使用专用压缩软件可以有效地压缩双端短读长序列数据。商业工具比免费软件提供更高的压缩比。