Ochoa Idoia, Hernaez Mikel, Goldfeder Rachel, Weissman Tsachy, Ashley Euan
Department of Electrical Engineering, Stanford University, Stanford, CA, 94305.
Department of Medicine, Stanford University, Stanford, CA, 94305.
Proc Data Compress Conf. 2016 Mar-Apr;2016:251-260. doi: 10.1109/DCC.2016.92. Epub 2016 Dec 19.
Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Much of the raw data are comprised of nucleotides and the corresponding quality scores that indicate their reliability. The latter are more difficult to compress and are themselves noisy. Lossless and lossy compression of the quality scores has recently been proposed to alleviate the storage costs, but reducing the noise in the quality scores has remained largely unexplored. This raw data is processed in order to identify variants; these genetic variants are used in important applications, such as medical decision making. Thus improving the performance of the variant calling by reducing the noise contained in the quality scores is important. We propose a denoising scheme that reduces the noise of the quality scores and we demonstrate improved inference with this denoised data. Specifically, we show that replacing the quality scores with those generated by the proposed denoiser results in more accurate variant calling in general. Moreover, a consequence of the denoising is that the entropy of the produced quality scores is smaller, and thus significant compression can be achieved with respect to lossless compression of the original quality scores. We expect our results to provide a baseline for future research in denoising of quality scores. The code used in this work as well as a Supplement with all the results are available at http://web.stanford.edu/~iochoa/DCCdenoiser_CodeAndSupplement.zip.
由于测序技术的进步和测序成本的大幅下降,大量的测序数据正在生成。许多原始数据由核苷酸以及表明其可靠性的相应质量得分组成。后者更难压缩且本身存在噪声。最近有人提出对质量得分进行无损和有损压缩以减轻存储成本,但在降低质量得分中的噪声方面,很大程度上仍未得到充分探索。处理这些原始数据是为了识别变异;这些基因变异被用于重要应用,如医疗决策。因此,通过降低质量得分中包含的噪声来提高变异检测的性能很重要。我们提出了一种去噪方案,该方案可降低质量得分的噪声,并证明使用这种去噪后的数据能改进推理。具体而言,我们表明,一般来说,用所提出的去噪器生成的质量得分替换原来的质量得分会导致更准确的变异检测。此外,去噪的一个结果是所产生的质量得分的熵更小,因此相对于原始质量得分的无损压缩,可以实现显著的压缩。我们期望我们的结果能为未来质量得分去噪研究提供一个基线。这项工作中使用的代码以及包含所有结果的补充材料可在http://web.stanford.edu/~iochoa/DCCdenoiser_CodeAndSupplement.zip获取。