Suppr超能文献

纳米孔质量得分分辨率可以降低,而对下游分析的影响很小。

Nanopore quality score resolution can be reduced with little effect on downstream analysis.

作者信息

Rivara-Espasandín Martín, Balestrazzi Lucía, Dufort Y Álvarez Guillermo, Ochoa Idoia, Seroussi Gadiel, Smircich Pablo, Sotelo-Silveira José, Martín Álvaro

机构信息

Instituto de Computación, Facultad de Ingeniería, Universidad de la República, 11300 Montevideo, Uruguay.

Departamento de Genética, Facultad de Medicina, Universidad de la República, 11800 Montevideo, Uruguay.

出版信息

Bioinform Adv. 2022 Aug 11;2(1):vbac054. doi: 10.1093/bioadv/vbac054. eCollection 2022.

Abstract

MOTIVATION

The use of high precision for representing quality scores in nanopore sequencing data makes these scores hard to compress and, thus, responsible for most of the information stored in losslessly compressed FASTQ files. This motivates the investigation of the effect of quality score information loss on downstream analysis from nanopore sequencing FASTQ files.

RESULTS

We polished assemblies for a mock microbial community and a human genome, and we called variants on a human genome. We repeated these experiments using various pipelines, under various coverage level scenarios and various quality score quantizers. In all cases, we found that the quantization of quality scores causes little difference (or even sometimes improves) on the results obtained with the original (non-quantized) data. This suggests that the precision that is currently used for nanopore quality scores may be unnecessarily high, and motivates the use of lossy compression algorithms for this kind of data. Moreover, we show that even a non-specialized compressor, such as gzip, yields large storage space savings after the quantization of quality scores.

AVAILABILITY AND SUPPLEMENTARY INFORMATION

Quantizers are freely available for download at: https://github.com/mrivarauy/QS-Quantizer.

摘要

动机

在纳米孔测序数据中使用高精度来表示质量分数使得这些分数难以压缩,因此,无损压缩的FASTQ文件中存储的大部分信息都由这些分数构成。这激发了对纳米孔测序FASTQ文件中质量分数信息丢失对下游分析的影响的研究。

结果

我们对一个模拟微生物群落和一个人类基因组的组装进行了优化,并在一个人类基因组上进行了变异检测。我们在各种覆盖水平场景和各种质量分数量化器下,使用各种流程重复了这些实验。在所有情况下,我们发现质量分数的量化对使用原始(未量化)数据获得的结果几乎没有差异(甚至有时会有所改善)。这表明目前用于纳米孔质量分数的精度可能过高,因此有必要对这类数据使用有损压缩算法。此外,我们表明,即使是像gzip这样的非专用压缩器,在质量分数量化后也能大幅节省存储空间。

可用性和补充信息

量化器可在以下网址免费下载:https://github.com/mrivarauy/QS-Quantizer

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/79bb/9710687/8b20896259b3/vbac054f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验