通过级联布隆过滤器实现快速无损压缩。

Fast lossless compression via cascading Bloom filters.

出版信息

BMC Bioinformatics. 2014;15 Suppl 9(Suppl 9):S7. doi: 10.1186/1471-2105-15-S9-S7. Epub 2014 Sep 10.

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4168706/

Abstract

BACKGROUND

Data from large Next Generation Sequencing (NGS) experiments present challenges both in terms of costs associated with storage and in time required for file transfer. It is sometimes possible to store only a summary relevant to particular applications, but generally it is desirable to keep all information needed to revisit experimental results in the future. Thus, the need for efficient lossless compression methods for NGS reads arises. It has been shown that NGS-specific compression schemes can improve results over generic compression methods, such as the Lempel-Ziv algorithm, Burrows-Wheeler transform, or Arithmetic Coding. When a reference genome is available, effective compression can be achieved by first aligning the reads to the reference genome, and then encoding each read using the alignment position combined with the differences in the read relative to the reference. These reference-based methods have been shown to compress better than reference-free schemes, but the alignment step they require demands several hours of CPU time on a typical dataset, whereas reference-free methods can usually compress in minutes.

RESULTS

We present a new approach that achieves highly efficient compression by using a reference genome, but completely circumvents the need for alignment, affording a great reduction in the time needed to compress. In contrast to reference-based methods that first align reads to the genome, we hash all reads into Bloom filters to encode, and decode by querying the same Bloom filters using read-length subsequences of the reference genome. Further compression is achieved by using a cascade of such filters.

CONCLUSIONS

Our method, called BARCODE, runs an order of magnitude faster than reference-based methods, while compressing an order of magnitude better than reference-free methods, over a broad range of sequencing coverage. In high coverage (50-100 fold), compared to the best tested compressors, BARCODE saves 80-90% of the running time while only increasing space slightly.

摘要

背景

大规模下一代测序（NGS）实验的数据在存储成本和文件传输所需时间方面都存在挑战。有时可以只存储与特定应用相关的摘要，但通常需要保留所有信息，以便将来重新访问实验结果。因此，需要针对 NGS 读取的高效无损压缩方法。已经表明，NGS 特定的压缩方案可以改善通用压缩方法（例如 Lempel-Ziv 算法、Burrows-Wheeler 变换或算术编码）的结果。当存在参考基因组时，可以通过首先将读取与参考基因组对齐，然后使用对齐位置与读取相对于参考的差异相结合来对每个读取进行编码，从而实现有效的压缩。这些基于参考的方法已被证明比无参考方案压缩得更好，但它们所需的对齐步骤在典型数据集上需要数小时的 CPU 时间，而无参考方法通常可以在几分钟内压缩。

结果

我们提出了一种新方法，该方法通过使用参考基因组实现了高效压缩，但完全避免了对齐的需要，大大减少了压缩所需的时间。与首先将读取与基因组对齐的基于参考的方法相反，我们将所有读取哈希到布隆过滤器中进行编码，并通过查询参考基因组的读取长度子序列来查询相同的布隆过滤器进行解码。通过使用此类过滤器的级联，可以进一步压缩。

结论

我们的方法称为 BARCODE，与基于参考的方法相比，运行速度快一个数量级，而与无参考的方法相比，压缩速度快一个数量级，适用于广泛的测序覆盖范围。在高覆盖率（50-100 倍）下，与经过最佳测试的压缩器相比，BARCODE 节省了 80-90%的运行时间，而仅略微增加了空间。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5171/4168706/04cdf0889c47/1471-2105-15-S9-S7-1.jpg

相似文献

Fast lossless compression via cascading Bloom filters.

BMC Bioinformatics. 2014;15 Suppl 9(Suppl 9):S7. doi: 10.1186/1471-2105-15-S9-S7. Epub 2014 Sep 10.

SCALCE: boosting sequence compression algorithms using locally consistent encoding.

Bioinformatics. 2012 Dec 1;28(23):3051-7. doi: 10.1093/bioinformatics/bts593. Epub 2012 Oct 9.

Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.

Bioinformatics. 2018 Feb 15;34(4):558-567. doi: 10.1093/bioinformatics/btx639.

Light-weight reference-based compression of FASTQ data.

BMC Bioinformatics. 2015 Jun 9;16(1):188. doi: 10.1186/s12859-015-0628-7.

smallWig: parallel compression of RNA-seq WIG files.

Bioinformatics. 2016 Jan 15;32(2):173-80. doi: 10.1093/bioinformatics/btv561. Epub 2015 Sep 30.

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.

BMC Bioinformatics. 2015 Sep 14;16:288. doi: 10.1186/s12859-015-0709-7.

Fast and memory efficient approach for mapping NGS reads to a reference genome.

J Bioinform Comput Biol. 2019 Apr;17(2):1950008. doi: 10.1142/S0219720019500082.

AFRESh: an adaptive framework for compression of reads and assembled sequences with random access functionality.

Bioinformatics. 2017 May 15;33(10):1464-1472. doi: 10.1093/bioinformatics/btx001.

ERGC: an efficient referential genome compression algorithm.

Bioinformatics. 2015 Nov 1;31(21):3468-75. doi: 10.1093/bioinformatics/btv399. Epub 2015 Jul 2.

Compression of next-generation sequencing reads aided by highly efficient de novo assembly.

Nucleic Acids Res. 2012 Dec;40(22):e171. doi: 10.1093/nar/gks754. Epub 2012 Aug 16.

引用本文的文献

OReO: optimizing read order for practical compression.

Bioinform Adv. 2025 Jun 3;5(1):vbaf128. doi: 10.1093/bioadv/vbaf128. eCollection 2025.

Efficient sequencing data compression and FPGA acceleration based on a two-step framework.

Front Genet. 2023 Sep 21;14:1260531. doi: 10.3389/fgene.2023.1260531. eCollection 2023.

To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics.

Nucleic Acids Res. 2020 Jun 4;48(10):5217-5234. doi: 10.1093/nar/gkaa265.

Pattern Matching for DNA Sequencing Data Using Multiple Bloom Filters.

Biomed Res Int. 2019 Apr 14;2019:7074387. doi: 10.1155/2019/7074387. eCollection 2019.

LW-FQZip 2: a parallelized reference-based compression of FASTQ files.

BMC Bioinformatics. 2017 Mar 20;18(1):179. doi: 10.1186/s12859-017-1588-x.

Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.

J Comput Biol. 2017 Jun;24(6):547-557. doi: 10.1089/cmb.2016.0155. Epub 2016 Nov 9.

Fast search of thousands of short-read sequencing experiments.

Nat Biotechnol. 2016 Mar;34(3):300-2. doi: 10.1038/nbt.3442. Epub 2016 Feb 8.

Compression of Large genomic datasets using COMRAD on Parallel Computing Platform.

Bioinformation. 2015 May 28;11(5):267-71. doi: 10.6026/97320630011267. eCollection 2015.

Data-dependent bucketing improves reference-free compression of sequencing reads.

Bioinformatics. 2015 Sep 1;31(17):2770-7. doi: 10.1093/bioinformatics/btv248. Epub 2015 Apr 24.

本文引用的文献

Space-efficient and exact de Bruijn graph representation based on a Bloom filter.

Algorithms Mol Biol. 2013 Sep 16;8(1):22. doi: 10.1186/1748-7188-8-22.

Compression of FASTQ and SAM format sequencing data.

PLoS One. 2013;8(3):e59190. doi: 10.1371/journal.pone.0059190. Epub 2013 Mar 22.

SCALCE: boosting sequence compression algorithms using locally consistent encoding.

Bioinformatics. 2012 Dec 1;28(23):3051-7. doi: 10.1093/bioinformatics/bts593. Epub 2012 Oct 9.

Compression of next-generation sequencing reads aided by highly efficient de novo assembly.

Nucleic Acids Res. 2012 Dec;40(22):e171. doi: 10.1093/nar/gks754. Epub 2012 Aug 16.

Scaling metagenome sequence assembly with probabilistic de Bruijn graphs.

Proc Natl Acad Sci U S A. 2012 Aug 14;109(33):13272-7. doi: 10.1073/pnas.1121464109. Epub 2012 Jul 30.

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform.

Bioinformatics. 2012 Jun 1;28(11):1415-9. doi: 10.1093/bioinformatics/bts173. Epub 2012 May 3.

Performance comparison of benchtop high-throughput sequencing platforms.

Nat Biotechnol. 2012 May;30(5):434-9. doi: 10.1038/nbt.2198.

Fast gapped-read alignment with Bowtie 2.

Nat Methods. 2012 Mar 4;9(4):357-9. doi: 10.1038/nmeth.1923.

Efficient counting of k-mers in DNA sequences using a bloom filter.

BMC Bioinformatics. 2011 Aug 10;12:333. doi: 10.1186/1471-2105-12-333.

Compressing genomic sequence fragments using SlimGene.

J Comput Biol. 2011 Mar;18(3):401-13. doi: 10.1089/cmb.2010.0253.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

通过级联布隆过滤器实现快速无损压缩。

Fast lossless compression via cascading Bloom filters.

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献