Department of Electrical Engineering, Stanford University, Stanford, CA, 94305, USA.
Sci Rep. 2023 Feb 6;13(1):2082. doi: 10.1038/s41598-023-29267-8.
The amount of data produced by genome sequencing experiments has been growing rapidly over the past several years, making compression important for efficient storage, transfer and analysis of the data. In recent years, nanopore sequencing technologies have seen increasing adoption since they are portable, real-time and provide long reads. However, there has been limited progress on compression of nanopore sequencing reads obtained in FASTQ files since most existing tools are either general-purpose or specialized for short read data. We present NanoSpring, a reference-free compressor for nanopore sequencing reads, relying on an approximate assembly approach. We evaluate NanoSpring on a variety of datasets including bacterial, metagenomic, plant, animal, and human whole genome data. For recently basecalled high quality nanopore datasets, NanoSpring, which focuses only on the base sequences in the FASTQ file, uses just 0.35-0.65 bits per base which is 3-6[Formula: see text] lower than general purpose compressors like gzip. NanoSpring is competitive in compression ratio and compression resource usage with the state-of-the-art tool CoLoRd while being significantly faster at decompression when using multiple threads (> 4[Formula: see text] faster decompression with 20 threads). NanoSpring is available on GitHub at https://github.com/qm2/NanoSpring .
基因组测序实验产生的数据量在过去几年中迅速增长,因此压缩对于数据的高效存储、传输和分析非常重要。近年来,由于纳米孔测序技术具有便携性、实时性和提供长读长的特点,因此越来越多地被采用。然而,由于大多数现有工具要么是通用的,要么是专门用于短读长数据的,因此在 FASTQ 文件中获得的纳米孔测序读长的压缩方面进展有限。我们提出了 NanoSpring,这是一种针对纳米孔测序读长的无参考压缩器,依赖于近似组装方法。我们在各种数据集上评估了 NanoSpring,包括细菌、宏基因组、植物、动物和人类全基因组数据。对于最近碱基调用的高质量纳米孔数据集,NanoSpring 只关注 FASTQ 文件中的碱基序列,每个碱基仅使用 0.35-0.65 位,比 gzip 等通用压缩器低 3-6 个数量级。NanoSpring 在压缩率和压缩资源使用方面与最先进的工具 CoLoRd 具有竞争力,而在使用多个线程时(使用 20 个线程时,解压速度快 4 倍以上),解压速度明显更快。NanoSpring 可在 GitHub 上获得,网址为 https://github.com/qm2/NanoSpring。