Suppr超能文献

用于增强推理和减少存储的质量得分去噪

Denoising of Quality Scores for Boosted Inference and Reduced Storage.

作者信息

Ochoa Idoia, Hernaez Mikel, Goldfeder Rachel, Weissman Tsachy, Ashley Euan

机构信息

Department of Electrical Engineering, Stanford University, Stanford, CA, 94305.

Department of Medicine, Stanford University, Stanford, CA, 94305.

出版信息

Proc Data Compress Conf. 2016 Mar-Apr;2016:251-260. doi: 10.1109/DCC.2016.92. Epub 2016 Dec 19.

Abstract

Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Much of the raw data are comprised of nucleotides and the corresponding quality scores that indicate their reliability. The latter are more difficult to compress and are themselves noisy. Lossless and lossy compression of the quality scores has recently been proposed to alleviate the storage costs, but reducing the noise in the quality scores has remained largely unexplored. This raw data is processed in order to identify variants; these genetic variants are used in important applications, such as medical decision making. Thus improving the performance of the variant calling by reducing the noise contained in the quality scores is important. We propose a denoising scheme that reduces the noise of the quality scores and we demonstrate improved inference with this denoised data. Specifically, we show that replacing the quality scores with those generated by the proposed denoiser results in more accurate variant calling in general. Moreover, a consequence of the denoising is that the entropy of the produced quality scores is smaller, and thus significant compression can be achieved with respect to lossless compression of the original quality scores. We expect our results to provide a baseline for future research in denoising of quality scores. The code used in this work as well as a Supplement with all the results are available at http://web.stanford.edu/~iochoa/DCCdenoiser_CodeAndSupplement.zip.

摘要

由于测序技术的进步和测序成本的大幅下降,大量的测序数据正在生成。许多原始数据由核苷酸以及表明其可靠性的相应质量得分组成。后者更难压缩且本身存在噪声。最近有人提出对质量得分进行无损和有损压缩以减轻存储成本,但在降低质量得分中的噪声方面,很大程度上仍未得到充分探索。处理这些原始数据是为了识别变异;这些基因变异被用于重要应用,如医疗决策。因此,通过降低质量得分中包含的噪声来提高变异检测的性能很重要。我们提出了一种去噪方案,该方案可降低质量得分的噪声,并证明使用这种去噪后的数据能改进推理。具体而言,我们表明,一般来说,用所提出的去噪器生成的质量得分替换原来的质量得分会导致更准确的变异检测。此外,去噪的一个结果是所产生的质量得分的熵更小,因此相对于原始质量得分的无损压缩,可以实现显著的压缩。我们期望我们的结果能为未来质量得分去噪研究提供一个基线。这项工作中使用的代码以及包含所有结果的补充材料可在http://web.stanford.edu/~iochoa/DCCdenoiser_CodeAndSupplement.zip获取。

相似文献

1
Denoising of Quality Scores for Boosted Inference and Reduced Storage.用于增强推理和减少存储的质量得分去噪
Proc Data Compress Conf. 2016 Mar-Apr;2016:251-260. doi: 10.1109/DCC.2016.92. Epub 2016 Dec 19.
3
A cluster-based approach to compression of Quality Scores.一种基于聚类的质量分数压缩方法。
Proc Data Compress Conf. 2016 Mar-Apr;2016:261-270. doi: 10.1109/DCC.2016.49. Epub 2016 Dec 19.
6
QVZ: lossy compression of quality values.QVZ:质量值的有损压缩。
Bioinformatics. 2015 Oct 1;31(19):3122-9. doi: 10.1093/bioinformatics/btv330. Epub 2015 May 28.
7
FCLQC: fast and concurrent lossless quality scores compressor.FCLQC:快速并发无损质量评分压缩器。
BMC Bioinformatics. 2021 Dec 20;22(1):606. doi: 10.1186/s12859-021-04516-7.
8
smallWig: parallel compression of RNA-seq WIG files.smallWig:RNA序列WIG文件的并行压缩
Bioinformatics. 2016 Jan 15;32(2):173-80. doi: 10.1093/bioinformatics/btv561. Epub 2015 Sep 30.
9
A Two-Level Scheme for Quality Score Compression.一种用于质量分数压缩的两级方案。
J Comput Biol. 2018 Oct;25(10):1141-1151. doi: 10.1089/cmb.2018.0065. Epub 2018 Jul 30.

本文引用的文献

1
Network Compression: Worst Case Analysis.网络压缩:最坏情况分析。
IEEE Trans Inf Theory. 2015 Jul;61(7):3980-3995. doi: 10.1109/tit.2015.2434829. Epub 2015 Jun 12.
4
QVZ: lossy compression of quality values.QVZ:质量值的有损压缩。
Bioinformatics. 2015 Oct 1;31(19):3122-9. doi: 10.1093/bioinformatics/btv330. Epub 2015 May 28.
7
Lossy compression of quality scores in genomic data.基因组数据中质量分数的有损压缩。
Bioinformatics. 2014 Aug 1;30(15):2130-6. doi: 10.1093/bioinformatics/btu183. Epub 2014 Apr 10.
10
Compression of FASTQ and SAM format sequencing data.FASTQ 和 SAM 格式测序数据的压缩。
PLoS One. 2013;8(3):e59190. doi: 10.1371/journal.pone.0059190. Epub 2013 Mar 22.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验