Rivara-Espasandín Martín, Balestrazzi Lucía, Dufort Y Álvarez Guillermo, Ochoa Idoia, Seroussi Gadiel, Smircich Pablo, Sotelo-Silveira José, Martín Álvaro
Instituto de Computación, Facultad de Ingeniería, Universidad de la República, 11300 Montevideo, Uruguay.
Departamento de Genética, Facultad de Medicina, Universidad de la República, 11800 Montevideo, Uruguay.
Bioinform Adv. 2022 Aug 11;2(1):vbac054. doi: 10.1093/bioadv/vbac054. eCollection 2022.
The use of high precision for representing quality scores in nanopore sequencing data makes these scores hard to compress and, thus, responsible for most of the information stored in losslessly compressed FASTQ files. This motivates the investigation of the effect of quality score information loss on downstream analysis from nanopore sequencing FASTQ files.
We polished assemblies for a mock microbial community and a human genome, and we called variants on a human genome. We repeated these experiments using various pipelines, under various coverage level scenarios and various quality score quantizers. In all cases, we found that the quantization of quality scores causes little difference (or even sometimes improves) on the results obtained with the original (non-quantized) data. This suggests that the precision that is currently used for nanopore quality scores may be unnecessarily high, and motivates the use of lossy compression algorithms for this kind of data. Moreover, we show that even a non-specialized compressor, such as gzip, yields large storage space savings after the quantization of quality scores.
Quantizers are freely available for download at: https://github.com/mrivarauy/QS-Quantizer.
在纳米孔测序数据中使用高精度来表示质量分数使得这些分数难以压缩,因此,无损压缩的FASTQ文件中存储的大部分信息都由这些分数构成。这激发了对纳米孔测序FASTQ文件中质量分数信息丢失对下游分析的影响的研究。
我们对一个模拟微生物群落和一个人类基因组的组装进行了优化,并在一个人类基因组上进行了变异检测。我们在各种覆盖水平场景和各种质量分数量化器下,使用各种流程重复了这些实验。在所有情况下,我们发现质量分数的量化对使用原始(未量化)数据获得的结果几乎没有差异(甚至有时会有所改善)。这表明目前用于纳米孔质量分数的精度可能过高,因此有必要对这类数据使用有损压缩算法。此外,我们表明,即使是像gzip这样的非专用压缩器,在质量分数量化后也能大幅节省存储空间。
量化器可在以下网址免费下载:https://github.com/mrivarauy/QS-Quantizer 。