NICTA Victoria Research Laboratory, Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.
Bioinformatics. 2014 Aug 1;30(15):2130-6. doi: 10.1093/bioinformatics/btu183. Epub 2014 Apr 10.
Next-generation sequencing technologies are revolutionizing medicine. Data from sequencing technologies are typically represented as a string of bases, an associated sequence of per-base quality scores and other metadata, and in aggregate can require a large amount of space. The quality scores show how accurate the bases are with respect to the sequencing process, that is, how confident the sequencer is of having called them correctly, and are the largest component in datasets in which they are retained. Previous research has examined how to store sequences of bases effectively; here we add to that knowledge by examining methods for compressing quality scores. The quality values originate in a continuous domain, and so if a fidelity criterion is introduced, it is possible to introduce flexibility in the way these values are represented, allowing lossy compression over the quality score data.
We present existing compression options for quality score data, and then introduce two new lossy techniques. Experiments measuring the trade-off between compression ratio and information loss are reported, including quantifying the effect of lossy representations on a downstream application that carries out single nucleotide polymorphism and insert/deletion detection. The new methods are demonstrably superior to other techniques when assessed against the spectrum of possible trade-offs between storage required and fidelity of representation.
An implementation of the methods described here is available at https://github.com/rcanovas/libCSAM.
rcanovas@student.unimelb.edu.au
Supplementary data are available at Bioinformatics online.
下一代测序技术正在彻底改变医学。测序技术产生的数据通常表示为碱基的字符串,以及碱基质量得分的序列和其他元数据,这些数据加起来可能需要大量的空间。质量得分显示了碱基相对于测序过程的准确性,也就是说,测序仪对其正确识别的置信度,并且是保留这些得分的数据集的最大组成部分。以前的研究已经研究了如何有效地存储碱基序列;在这里,我们通过研究压缩质量得分的方法来扩展这方面的知识。质量值源自连续域,因此,如果引入保真度标准,则可以在表示这些值的方式上引入灵活性,从而可以对质量得分数据进行有损压缩。
我们介绍了质量得分数据的现有压缩选项,然后引入了两种新的有损技术。报告了测量压缩比和信息丢失之间权衡的实验,包括量化有损表示对下游应用程序的影响,该应用程序执行单核苷酸多态性和插入/缺失检测。与存储要求和表示保真度之间可能的权衡范围相比,新方法在评估时明显优于其他技术。
此处描述的方法的实现可在 https://github.com/rcanovas/libCSAM 上获得。
rcanovas@student.unimelb.edu.au
补充数据可在 Bioinformatics 在线获得。