Department of Electrical Engineering, Stanford University, Stanford, CA, USA.
Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
Bioinformatics. 2019 Aug 1;35(15):2674-2676. doi: 10.1093/bioinformatics/bty1015.
High-Throughput Sequencing technologies produce huge amounts of data in the form of short genomic reads, associated quality values and read identifiers. Because of the significant structure present in these FASTQ datasets, general-purpose compressors are unable to completely exploit much of the inherent redundancy. Although there has been a lot of work on designing FASTQ compressors, most of them lack in support of one or more crucial properties, such as support for variable length reads, scalability to high coverage datasets, pairing-preserving compression and lossless compression.
In this work, we propose SPRING, a reference-free compressor for FASTQ files. SPRING supports a wide variety of compression modes and features, including lossless compression, pairing-preserving compression, lossy compression of quality values, long read compression and random access. SPRING achieves substantially better compression than existing tools, for example, SPRING compresses 195 GB of 25× whole genome human FASTQ from Illumina's NovaSeq sequencer to less than 7 GB, around 1.6× smaller than previous state-of-the-art FASTQ compressors. SPRING achieves this improvement while using comparable computational resources.
SPRING can be downloaded from https://github.com/shubhamchandak94/SPRING.
Supplementary data are available at Bioinformatics online.
高通量测序技术以短基因组读段、相关质量值和读段标识符的形式生成大量数据。由于这些 FASTQ 数据集具有显著的结构,通用压缩器无法完全利用其中的大部分固有冗余。尽管已经有很多关于设计 FASTQ 压缩器的工作,但其中大多数都缺乏对一个或多个关键特性的支持,例如支持可变长度读段、可扩展到高覆盖率数据集、保留配对的压缩和无损压缩。
在这项工作中,我们提出了 SPRING,一种用于 FASTQ 文件的无参考压缩器。SPRING 支持各种压缩模式和功能,包括无损压缩、保留配对的压缩、质量值的有损压缩、长读段压缩和随机访问。SPRING 实现了比现有工具更好的压缩效果,例如,SPRING 将 Illumina 的 NovaSeq 测序仪生成的 25×全基因组人类 FASTQ 压缩到不到 7GB,比以前的最先进的 FASTQ 压缩器小约 1.6 倍。SPRING 在使用可比计算资源的同时实现了这一改进。
可以从 https://github.com/shubhamchandak94/SPRING 下载 SPRING。
补充数据可在 Bioinformatics 在线获取。