基因组数据中质量分数的有损压缩。

Lossy compression of quality scores in genomic data.

机构信息

NICTA Victoria Research Laboratory, Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia.

出版信息

Bioinformatics. 2014 Aug 1;30(15):2130-6. doi: 10.1093/bioinformatics/btu183. Epub 2014 Apr 10.

DOI:10.1093/bioinformatics/btu183

PMID:24728856

Abstract

MOTIVATION

Next-generation sequencing technologies are revolutionizing medicine. Data from sequencing technologies are typically represented as a string of bases, an associated sequence of per-base quality scores and other metadata, and in aggregate can require a large amount of space. The quality scores show how accurate the bases are with respect to the sequencing process, that is, how confident the sequencer is of having called them correctly, and are the largest component in datasets in which they are retained. Previous research has examined how to store sequences of bases effectively; here we add to that knowledge by examining methods for compressing quality scores. The quality values originate in a continuous domain, and so if a fidelity criterion is introduced, it is possible to introduce flexibility in the way these values are represented, allowing lossy compression over the quality score data.

RESULTS

We present existing compression options for quality score data, and then introduce two new lossy techniques. Experiments measuring the trade-off between compression ratio and information loss are reported, including quantifying the effect of lossy representations on a downstream application that carries out single nucleotide polymorphism and insert/deletion detection. The new methods are demonstrably superior to other techniques when assessed against the spectrum of possible trade-offs between storage required and fidelity of representation.

AVAILABILITY AND IMPLEMENTATION

An implementation of the methods described here is available at https://github.com/rcanovas/libCSAM.

CONTACT

rcanovas@student.unimelb.edu.au

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

下一代测序技术正在彻底改变医学。测序技术产生的数据通常表示为碱基的字符串，以及碱基质量得分的序列和其他元数据，这些数据加起来可能需要大量的空间。质量得分显示了碱基相对于测序过程的准确性，也就是说，测序仪对其正确识别的置信度，并且是保留这些得分的数据集的最大组成部分。以前的研究已经研究了如何有效地存储碱基序列；在这里，我们通过研究压缩质量得分的方法来扩展这方面的知识。质量值源自连续域，因此，如果引入保真度标准，则可以在表示这些值的方式上引入灵活性，从而可以对质量得分数据进行有损压缩。

结果

我们介绍了质量得分数据的现有压缩选项，然后引入了两种新的有损技术。报告了测量压缩比和信息丢失之间权衡的实验，包括量化有损表示对下游应用程序的影响，该应用程序执行单核苷酸多态性和插入/缺失检测。与存储要求和表示保真度之间可能的权衡范围相比，新方法在评估时明显优于其他技术。

可用性和实现

此处描述的方法的实现可在 https://github.com/rcanovas/libCSAM 上获得。

联系方式

rcanovas@student.unimelb.edu.au

补充信息

补充数据可在 Bioinformatics 在线获得。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

基因组数据中质量分数的有损压缩。

Lossy compression of quality scores in genomic data.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

CONTACT

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

联系方式

补充信息

相似文献

引用本文的文献

基因组数据中质量分数的有损压缩。

Lossy compression of quality scores in genomic data.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

CONTACT

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

联系方式

补充信息

相似文献

引用本文的文献