Nazari Foad, Patel Sneh, LaRocca Melissa, Sansevich Alina, Czarny Ryan, Schena Giana, Murray Emma K
Rajant Health Incorporated, 200 Chesterfield Parkway, Malvern, PA, 19355PA, USA.
Sci Rep. 2025 Jan 2;15(1):322. doi: 10.1038/s41598-024-79258-6.
As sequencing becomes more accessible, there is an acute need for novel compression methods to efficiently store sequencing files. Omics analytics can leverage sequencing technologies to enhance biomedical research and individualize patient care, but sequencing files demand immense storage capabilities, particularly when sequencing is utilized for longitudinal studies. Addressing the storage challenges posed by these technologies is crucial for omics analytics to achieve their full potential. We present a novel lossless, reference-free compression algorithm, GeneSqueeze, that leverages the patterns inherent in the underlying components of FASTQ files to solve this need. GeneSqueeze's benefits include an auto-tuning compression protocol based on each file's distribution, lossless preservation of IUPAC nucleotides and read identifiers, and unrestricted FASTQ/A file attributes (i.e., read length, number of reads, or read identifier format). We compared GeneSqueeze to the general-purpose compressor, gzip, and to a domain-specific compressor, SPRING, to assess performance. Due to GeneSqueeze's current Python implementation, GeneSqueeze underperformed as compared to gzip and SPRING in the time domain. GeneSqueeze and gzip achieved 100% lossless compression across all elements of the FASTQ files (i.e. the read identifier, sequence, quality score and ' + ' lines). GeneSqueeze and gzip compressed all files losslessly, while both SPRING's traditional and lossless modes exhibited data loss of non-ACGTN IUPAC nucleotides and of metadata following the ' + ' on the separator line. GeneSqueeze showed up to three times higher compression ratios as compared to gzip, regardless of read length, number of reads, or file size, and had comparable compression ratios to SPRING across a variety of factors. Overall, GeneSqueeze represents a competitive and specialized compression method for FASTQ/A files containing nucleotide sequences. As such, GeneSqueeze has the potential to significantly reduce the storage and transmission costs associated with large omics datasets without sacrificing data integrity.
随着测序技术的普及,迫切需要新的压缩方法来高效存储测序文件。组学分析可以利用测序技术加强生物医学研究并实现患者护理的个性化,但测序文件需要巨大的存储能力,尤其是在将测序用于纵向研究时。应对这些技术带来的存储挑战对于组学分析充分发挥其潜力至关重要。我们提出了一种新颖的无损、无参考压缩算法GeneSqueeze,它利用FASTQ文件基础组件中固有的模式来满足这一需求。GeneSqueeze的优点包括基于每个文件的分布自动调整压缩协议、无损保留IUPAC核苷酸和读取标识符,以及不受限制的FASTQ/A文件属性(即读取长度、读取数量或读取标识符格式)。我们将GeneSqueeze与通用压缩器gzip以及特定领域压缩器SPRING进行比较,以评估性能。由于GeneSqueeze目前采用Python实现,在时间方面与gzip和SPRING相比表现较差。GeneSqueeze和gzip在FASTQ文件的所有元素(即读取标识符、序列、质量得分和“+”行)上均实现了100%无损压缩。GeneSqueeze和gzip对所有文件进行了无损压缩,而SPRING的传统模式和无损模式在分隔线上“+”之后均出现了非ACGTN IUPAC核苷酸和元数据的数据丢失。无论读取长度、读取数量或文件大小如何,GeneSqueeze的压缩率比gzip高出多达三倍,并且在各种因素下与SPRING的压缩率相当。总体而言,GeneSqueeze是一种针对包含核苷酸序列的FASTQ/A文件的有竞争力的专用压缩方法。因此,GeneSqueeze有潜力在不牺牲数据完整性的情况下,显著降低与大型组学数据集相关的存储和传输成本。