Center for Integrative Bioinformatics Vienna, Max F Perutz Laboratories, University of Vienna, Medical University of Vienna, Dr Bohr Gasse 9, Vienna A-1030, Austria.
Nucleic Acids Res. 2013 Jan 7;41(1):e27. doi: 10.1093/nar/gks939. Epub 2012 Oct 12.
A major challenge of current high-throughput sequencing experiments is not only the generation of the sequencing data itself but also their processing, storage and transmission. The enormous size of these data motivates the development of data compression algorithms usable for the implementation of the various storage policies that are applied to the produced intermediate and final result files. In this article, we present NGC, a tool for the compression of mapped short read data stored in the wide-spread SAM format. NGC enables lossless and lossy compression and introduces the following two novel ideas: first, we present a way to reduce the number of required code words by exploiting common features of reads mapped to the same genomic positions; second, we present a highly configurable way for the quantization of per-base quality values, which takes their influence on downstream analyses into account. NGC, evaluated with several real-world data sets, saves 33-66% of disc space using lossless and up to 98% disc space using lossy compression. By applying two popular variant and genotype prediction tools to the decompressed data, we could show that the lossy compression modes preserve >99% of all called variants while outperforming comparable methods in some configurations.
当前高通量测序实验的一个主要挑战不仅在于测序数据本身的产生,还在于它们的处理、存储和传输。这些数据的巨大规模促使开发了数据压缩算法,可用于实现应用于产生的中间和最终结果文件的各种存储策略。在本文中,我们介绍了 NGC,这是一种用于压缩以广泛使用的 SAM 格式存储的映射短读数据的工具。NGC 支持无损和有损压缩,并引入了以下两个新颖的想法:首先,我们提出了一种利用映射到相同基因组位置的读取的常见特征来减少所需码字数量的方法;其次,我们提出了一种高度可配置的方法,用于对每个碱基质量值进行量化,考虑到它们对下游分析的影响。使用几种真实数据集评估的 NGC 可以使用无损压缩节省 33-66%的磁盘空间,使用有损压缩最多节省 98%的磁盘空间。通过将两种流行的变体和基因型预测工具应用于解压缩数据,我们可以证明有损压缩模式在保留所有调用变体的>99%的同时,在某些配置中优于可比方法。