使用GeneSqueeze对FASTQ/A文件进行无损且无参考的压缩。

Lossless and reference-free compression of FASTQ/A files using GeneSqueeze.

作者信息

Nazari Foad, Patel Sneh, LaRocca Melissa, Sansevich Alina, Czarny Ryan, Schena Giana, Murray Emma K

机构信息

Rajant Health Incorporated, 200 Chesterfield Parkway, Malvern, PA, 19355PA, USA.

出版信息

Sci Rep. 2025 Jan 2;15(1):322. doi: 10.1038/s41598-024-79258-6.

DOI:10.1038/s41598-024-79258-6

PMID:39747361

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11696233/

Abstract

As sequencing becomes more accessible, there is an acute need for novel compression methods to efficiently store sequencing files. Omics analytics can leverage sequencing technologies to enhance biomedical research and individualize patient care, but sequencing files demand immense storage capabilities, particularly when sequencing is utilized for longitudinal studies. Addressing the storage challenges posed by these technologies is crucial for omics analytics to achieve their full potential. We present a novel lossless, reference-free compression algorithm, GeneSqueeze, that leverages the patterns inherent in the underlying components of FASTQ files to solve this need. GeneSqueeze's benefits include an auto-tuning compression protocol based on each file's distribution, lossless preservation of IUPAC nucleotides and read identifiers, and unrestricted FASTQ/A file attributes (i.e., read length, number of reads, or read identifier format). We compared GeneSqueeze to the general-purpose compressor, gzip, and to a domain-specific compressor, SPRING, to assess performance. Due to GeneSqueeze's current Python implementation, GeneSqueeze underperformed as compared to gzip and SPRING in the time domain. GeneSqueeze and gzip achieved 100% lossless compression across all elements of the FASTQ files (i.e. the read identifier, sequence, quality score and ' + ' lines). GeneSqueeze and gzip compressed all files losslessly, while both SPRING's traditional and lossless modes exhibited data loss of non-ACGTN IUPAC nucleotides and of metadata following the ' + ' on the separator line. GeneSqueeze showed up to three times higher compression ratios as compared to gzip, regardless of read length, number of reads, or file size, and had comparable compression ratios to SPRING across a variety of factors. Overall, GeneSqueeze represents a competitive and specialized compression method for FASTQ/A files containing nucleotide sequences. As such, GeneSqueeze has the potential to significantly reduce the storage and transmission costs associated with large omics datasets without sacrificing data integrity.

摘要

随着测序技术的普及，迫切需要新的压缩方法来高效存储测序文件。组学分析可以利用测序技术加强生物医学研究并实现患者护理的个性化，但测序文件需要巨大的存储能力，尤其是在将测序用于纵向研究时。应对这些技术带来的存储挑战对于组学分析充分发挥其潜力至关重要。我们提出了一种新颖的无损、无参考压缩算法GeneSqueeze，它利用FASTQ文件基础组件中固有的模式来满足这一需求。GeneSqueeze的优点包括基于每个文件的分布自动调整压缩协议、无损保留IUPAC核苷酸和读取标识符，以及不受限制的FASTQ/A文件属性（即读取长度、读取数量或读取标识符格式）。我们将GeneSqueeze与通用压缩器gzip以及特定领域压缩器SPRING进行比较，以评估性能。由于GeneSqueeze目前采用Python实现，在时间方面与gzip和SPRING相比表现较差。GeneSqueeze和gzip在FASTQ文件的所有元素（即读取标识符、序列、质量得分和“+”行）上均实现了100%无损压缩。GeneSqueeze和gzip对所有文件进行了无损压缩，而SPRING的传统模式和无损模式在分隔线上“+”之后均出现了非ACGTN IUPAC核苷酸和元数据的数据丢失。无论读取长度、读取数量或文件大小如何，GeneSqueeze的压缩率比gzip高出多达三倍，并且在各种因素下与SPRING的压缩率相当。总体而言，GeneSqueeze是一种针对包含核苷酸序列的FASTQ/A文件的有竞争力的专用压缩方法。因此，GeneSqueeze有潜力在不牺牲数据完整性的情况下，显著降低与大型组学数据集相关的存储和传输成本。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0526/11696233/63ed3d746f44/41598_2024_79258_Fig1_HTML.jpg

相似文献

Lossless and reference-free compression of FASTQ/A files using GeneSqueeze.使用GeneSqueeze对FASTQ/A文件进行无损且无参考的压缩。

Sci Rep. 2025 Jan 2;15(1):322. doi: 10.1038/s41598-024-79258-6.

LFQC: a lossless compression algorithm for FASTQ files.LFQC：一种用于FASTQ文件的无损压缩算法。

Bioinformatics. 2015 Oct 15;31(20):3276-81. doi: 10.1093/bioinformatics/btv384. Epub 2015 Jun 20.

SPRING: a next-generation compressor for FASTQ data.SPRING：FASTQ 数据的下一代压缩程序。

Bioinformatics. 2019 Aug 1;35(15):2674-2676. doi: 10.1093/bioinformatics/bty1015.

SCALCE: boosting sequence compression algorithms using locally consistent encoding.SCALCE：使用局部一致编码提升序列压缩算法。

Bioinformatics. 2012 Dec 1;28(23):3051-7. doi: 10.1093/bioinformatics/bts593. Epub 2012 Oct 9.

A new efficient referential genome compression technique for FastQ files.一种用于 FastQ 文件的新型高效参照基因组压缩技术。

Funct Integr Genomics. 2023 Nov 11;23(4):333. doi: 10.1007/s10142-023-01259-x.

Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences.核苷酸档案格式 (NAF) 可实现 DNA 序列的高效无损、无参考自由压缩。

Bioinformatics. 2019 Oct 1;35(19):3826-3828. doi: 10.1093/bioinformatics/btz144.

BEETL-fastq: a searchable compressed archive for DNA reads.BEETL-fastq：一种用于DNA读数的可搜索压缩存档。

Bioinformatics. 2014 Oct;30(19):2796-801. doi: 10.1093/bioinformatics/btu387. Epub 2014 Jun 20.

A benchmark study of compression software for human short-read sequence data.人类短读长序列数据压缩软件的基准研究。

Sci Rep. 2025 May 2;15(1):15358. doi: 10.1038/s41598-025-00491-8.

Light-weight reference-based compression of FASTQ data.FASTQ数据的轻量级基于参考的压缩

BMC Bioinformatics. 2015 Jun 9;16(1):188. doi: 10.1186/s12859-015-0628-7.

Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach.使用近似组装方法对纳米孔测序读取进行无参考无损压缩。

Sci Rep. 2023 Feb 6;13(1):2082. doi: 10.1038/s41598-023-29267-8.

引用本文的文献

A benchmark study of compression software for human short-read sequence data.人类短读长序列数据压缩软件的基准研究。

Sci Rep. 2025 May 2;15(1):15358. doi: 10.1038/s41598-025-00491-8.

本文引用的文献

Efficient sequencing data compression and FPGA acceleration based on a two-step framework.基于两步框架的高效测序数据压缩与现场可编程门阵列加速

Front Genet. 2023 Sep 21;14:1260531. doi: 10.3389/fgene.2023.1260531. eCollection 2023.

Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format.使用核苷酸存档格式对严重急性呼吸综合征冠状病毒2（SARS-CoV-2）基因组数据进行高效压缩。

Patterns (N Y). 2022 Sep 9;3(9):100562. doi: 10.1016/j.patter.2022.100562. Epub 2022 Jul 7.

CoLoRd: compressing long reads.CoLoRd：压缩长读。

Nat Methods. 2022 Apr;19(4):441-444. doi: 10.1038/s41592-022-01432-3. Epub 2022 Mar 28.

Genozip: a universal extensible genomic data compressor.Genozip：一种通用的可扩展基因组数据压缩器。

Bioinformatics. 2021 Aug 25;37(16):2225-2230. doi: 10.1093/bioinformatics/btab102.

Best practices for variant calling in clinical sequencing.临床测序中变异调用的最佳实践。

Genome Med. 2020 Oct 26;12(1):91. doi: 10.1186/s13073-020-00791-w.

FQSqueezer: k-mer-based compression of sequencing data.FQSqueezer：基于 k-mer 的测序数据压缩。

Sci Rep. 2020 Jan 17;10(1):578. doi: 10.1038/s41598-020-57452-6.

MZPAQ: a FASTQ data compression tool.MZPAQ：一种FASTQ数据压缩工具。

Source Code Biol Med. 2019 Jun 3;14:3. doi: 10.1186/s13029-019-0073-5. eCollection 2019.

SPRING: a next-generation compressor for FASTQ data.SPRING：FASTQ 数据的下一代压缩程序。

Bioinformatics. 2019 Aug 1;35(15):2674-2676. doi: 10.1093/bioinformatics/bty1015.

FaStore: a space-saving solution for raw sequencing data.FaStore：一种节省存储空间的原始测序数据解决方案。

Bioinformatics. 2018 Aug 15;34(16):2748-2756. doi: 10.1093/bioinformatics/bty205.

Multi-perspective quality control of Illumina RNA sequencing data analysis.Illumina RNA 测序数据分析的多角度质量控制。

Brief Funct Genomics. 2017 Jul 1;16(4):194-204. doi: 10.1093/bfgp/elw035.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用GeneSqueeze对FASTQ/A文件进行无损且无参考的压缩。

Lossless and reference-free compression of FASTQ/A files using GeneSqueeze.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献