Kozanitis Christos, Saunders Chris, Kruglyak Semyon, Bafna Vineet, Varghese George
Department of Computer Science and Engineering, University of California, San Diego, California, USA.
J Comput Biol. 2011 Mar;18(3):401-13. doi: 10.1089/cmb.2010.0253.
With the advent of next generation sequencing technologies, the cost of sequencing whole genomes is poised to go below $1000 per human individual in a few years. As more and more genomes are sequenced, analysis methods are undergoing rapid development, making it tempting to store sequencing data for long periods of time so that the data can be re-analyzed with the latest techniques. The challenging open research problems, huge influx of data, and rapidly improving analysis techniques have created the need to store and transfer very large volumes of data. Compression can be achieved at many levels, including trace level (compressing image data), sequence level (compressing a genomic sequence), and fragment-level (compressing a set of short, redundant fragment reads, along with quality-values on the base-calls). We focus on fragment-level compression, which is the pressing need today. Our article makes two contributions, implemented in a tool, SlimGene. First, we introduce a set of domain specific loss-less compression schemes that achieve over 40× compression of fragments, outperforming bzip2 by over 6×. Including quality values, we show a 5× compression using less running time than bzip2. Second, given the discrepancy between the compression factor obtained with and without quality values, we initiate the study of using "lossy" quality values. Specifically, we show that a lossy quality value quantization results in 14× compression but has minimal impact on downstream applications like SNP calling that use the quality values. Discrepancies between SNP calls made between the lossy and loss-less versions of the data are limited to low coverage areas where even the SNP calls made by the loss-less version are marginal.
随着下一代测序技术的出现,全基因组测序的成本预计在几年内将降至每人1000美元以下。随着越来越多的基因组被测序,分析方法正在迅速发展,这使得人们倾向于长时间存储测序数据,以便能够使用最新技术对数据进行重新分析。具有挑战性的开放研究问题、大量的数据涌入以及迅速改进的分析技术,使得存储和传输非常大量的数据成为必要。压缩可以在多个层面实现,包括痕量水平(压缩图像数据)、序列水平(压缩基因组序列)和片段水平(压缩一组短的、冗余的片段读取,以及碱基调用上的质量值)。我们专注于片段水平的压缩,这是当今迫切的需求。我们的文章做出了两项贡献,并在一个名为SlimGene的工具中得以实现。首先,我们引入了一组特定领域的无损压缩方案,这些方案能够实现对片段超过40倍的压缩,比bzip2的压缩效果高出6倍以上。包括质量值在内,我们展示了使用比bzip2更少的运行时间实现5倍的压缩。其次,鉴于有无质量值时获得的压缩因子存在差异,我们开启了对使用“有损”质量值的研究。具体而言,我们表明有损质量值量化可实现14倍的压缩,但对使用质量值的下游应用(如单核苷酸多态性(SNP)检测)的影响最小。有损和无损版本数据之间的SNP检测差异仅限于低覆盖区域,在这些区域中,即使是无损版本进行的SNP检测也很勉强。