BMC Bioinformatics. 2014;15 Suppl 9(Suppl 9):S7. doi: 10.1186/1471-2105-15-S9-S7. Epub 2014 Sep 10.
Data from large Next Generation Sequencing (NGS) experiments present challenges both in terms of costs associated with storage and in time required for file transfer. It is sometimes possible to store only a summary relevant to particular applications, but generally it is desirable to keep all information needed to revisit experimental results in the future. Thus, the need for efficient lossless compression methods for NGS reads arises. It has been shown that NGS-specific compression schemes can improve results over generic compression methods, such as the Lempel-Ziv algorithm, Burrows-Wheeler transform, or Arithmetic Coding. When a reference genome is available, effective compression can be achieved by first aligning the reads to the reference genome, and then encoding each read using the alignment position combined with the differences in the read relative to the reference. These reference-based methods have been shown to compress better than reference-free schemes, but the alignment step they require demands several hours of CPU time on a typical dataset, whereas reference-free methods can usually compress in minutes.
We present a new approach that achieves highly efficient compression by using a reference genome, but completely circumvents the need for alignment, affording a great reduction in the time needed to compress. In contrast to reference-based methods that first align reads to the genome, we hash all reads into Bloom filters to encode, and decode by querying the same Bloom filters using read-length subsequences of the reference genome. Further compression is achieved by using a cascade of such filters.
Our method, called BARCODE, runs an order of magnitude faster than reference-based methods, while compressing an order of magnitude better than reference-free methods, over a broad range of sequencing coverage. In high coverage (50-100 fold), compared to the best tested compressors, BARCODE saves 80-90% of the running time while only increasing space slightly.
大规模下一代测序(NGS)实验的数据在存储成本和文件传输所需时间方面都存在挑战。有时可以只存储与特定应用相关的摘要,但通常需要保留所有信息,以便将来重新访问实验结果。因此,需要针对 NGS 读取的高效无损压缩方法。已经表明,NGS 特定的压缩方案可以改善通用压缩方法(例如 Lempel-Ziv 算法、Burrows-Wheeler 变换或算术编码)的结果。当存在参考基因组时,可以通过首先将读取与参考基因组对齐,然后使用对齐位置与读取相对于参考的差异相结合来对每个读取进行编码,从而实现有效的压缩。这些基于参考的方法已被证明比无参考方案压缩得更好,但它们所需的对齐步骤在典型数据集上需要数小时的 CPU 时间,而无参考方法通常可以在几分钟内压缩。
我们提出了一种新方法,该方法通过使用参考基因组实现了高效压缩,但完全避免了对齐的需要,大大减少了压缩所需的时间。与首先将读取与基因组对齐的基于参考的方法相反,我们将所有读取哈希到布隆过滤器中进行编码,并通过查询参考基因组的读取长度子序列来查询相同的布隆过滤器进行解码。通过使用此类过滤器的级联,可以进一步压缩。
我们的方法称为 BARCODE,与基于参考的方法相比,运行速度快一个数量级,而与无参考的方法相比,压缩速度快一个数量级,适用于广泛的测序覆盖范围。在高覆盖率(50-100 倍)下,与经过最佳测试的压缩器相比,BARCODE 节省了 80-90%的运行时间,而仅略微增加了空间。