从测序数据中快速且低内存地压缩德布鲁因图。

Compacting de Bruijn graphs from sequencing data quickly and in low memory.

作者信息

Chikhi Rayan, Limasset Antoine, Medvedev Paul

机构信息

CNRS, CRIStAL, Lille, France.

ENS Cachan Brittany, Bruz, France.

出版信息

Bioinformatics. 2016 Jun 15;32(12):i201-i208. doi: 10.1093/bioinformatics/btw279.

DOI:10.1093/bioinformatics/btw279

PMID:27307618

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4908363/

Abstract

MOTIVATION

As the quantity of data per sequencing experiment increases, the challenges of fragment assembly are becoming increasingly computational. The de Bruijn graph is a widely used data structure in fragment assembly algorithms, used to represent the information from a set of reads. Compaction is an important data reduction step in most de Bruijn graph based algorithms where long simple paths are compacted into single vertices. Compaction has recently become the bottleneck in assembly pipelines, and improving its running time and memory usage is an important problem.

RESULTS

We present an algorithm and a tool bcalm 2 for the compaction of de Bruijn graphs. bcalm 2 is a parallel algorithm that distributes the input based on a minimizer hashing technique, allowing for good balance of memory usage throughout its execution. For human sequencing data, bcalm 2 reduces the computational burden of compacting the de Bruijn graph to roughly an hour and 3 GB of memory. We also applied bcalm 2 to the 22 Gbp loblolly pine and 20 Gbp white spruce sequencing datasets. Compacted graphs were constructed from raw reads in less than 2 days and 40 GB of memory on a single machine. Hence, bcalm 2 is at least an order of magnitude more efficient than other available methods.

AVAILABILITY AND IMPLEMENTATION

Source code of bcalm 2 is freely available at: https://github.com/GATB/bcalm

CONTACT

rayan.chikhi@univ-lille1.fr.

摘要

动机

随着每个测序实验的数据量增加，片段组装的挑战在计算方面日益凸显。德布鲁因图是片段组装算法中广泛使用的数据结构，用于表示一组 reads 的信息。压缩是大多数基于德布鲁因图的算法中的一个重要数据缩减步骤，其中长的简单路径被压缩为单个顶点。压缩最近已成为组装流程中的瓶颈，提高其运行时间和内存使用是一个重要问题。

结果

我们提出了一种用于压缩德布鲁因图的算法和工具 bcalm 2。bcalm 2 是一种并行算法，它基于最小化哈希技术分配输入，在整个执行过程中实现良好的内存使用平衡。对于人类测序数据，bcalm 2 将压缩德布鲁因图的计算负担减少到大约一小时和 3GB 内存。我们还将 bcalm 2 应用于 22 Gbp 的火炬松和 20 Gbp 的白云杉测序数据集。在单台机器上，从原始 reads 构建压缩图不到 2 天，内存使用 40GB。因此，bcalm 2 比其他现有方法至少高效一个数量级。

可用性和实现

bcalm 2 的源代码可在以下网址免费获取：https://github.com/GATB/bcalm

联系方式

rayan.chikhi@univ-lille1.fr

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1751/4908363/299f6cf31803/btw279f1p.jpg

相似文献

Compacting de Bruijn graphs from sequencing data quickly and in low memory.

Bioinformatics. 2016 Jun 15;32(12):i201-i208. doi: 10.1093/bioinformatics/btw279.

Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections.

Bioinformatics. 2021 Jul 12;37(Suppl_1):i177-i186. doi: 10.1093/bioinformatics/btab309.

Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer.

Cell Syst. 2021 Oct 20;12(10):958-968.e6. doi: 10.1016/j.cels.2021.08.009. Epub 2021 Sep 14.

Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs.

BMC Bioinformatics. 2010 Nov 15;11:560. doi: 10.1186/1471-2105-11-560.

BrownieAligner: accurate alignment of Illumina sequencing data to de Bruijn graphs.

BMC Bioinformatics. 2018 Sep 4;19(1):311. doi: 10.1186/s12859-018-2319-7.

Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs.

Genome Biol. 2020 Sep 17;21(1):249. doi: 10.1186/s13059-020-02135-8.

A space and time-efficient index for the compacted colored de Bruijn graph.

Bioinformatics. 2018 Jul 1;34(13):i169-i177. doi: 10.1093/bioinformatics/bty292.

MBG: Minimizer-based sparse de Bruijn Graph construction.

Bioinformatics. 2021 Aug 25;37(16):2476-2478. doi: 10.1093/bioinformatics/btab004.

Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2.

Genome Biol. 2022 Sep 8;23(1):190. doi: 10.1186/s13059-022-02743-6.

Practical dynamic de Bruijn graphs.

Bioinformatics. 2018 Dec 15;34(24):4189-4195. doi: 10.1093/bioinformatics/bty500.

引用本文的文献

Plant graph-based pangenomics: techniques, applications, and challenges.

aBIOTECH. 2025 Mar 28;6(2):361-376. doi: 10.1007/s42994-025-00206-7. eCollection 2025 Jun.

Sama: a contig assembler with correctness guarantee.

Algorithms Mol Biol. 2025 Jun 3;20(1):9. doi: 10.1186/s13015-025-00280-y.

A scalable distributed pipeline for reference-free variants calling.

BMC Genomics. 2025 Jun 3;26(Suppl 1):557. doi: 10.1186/s12864-025-11722-7.

Applying the Safe-And-Complete Framework to Practical Genome Assembly.

Lebniz Int Proc Inform. 2024;312. doi: 10.4230/LIPIcs.WABI.2024.8. Epub 2024 Aug 26.

Efficient and robust search of microbial genomes via phylogenetic compression.

Nat Methods. 2025 Apr;22(4):692-697. doi: 10.1038/s41592-025-02625-2. Epub 2025 Apr 9.

The open-closed mod-minimizer algorithm.

Algorithms Mol Biol. 2025 Mar 17;20(1):4. doi: 10.1186/s13015-025-00270-0.

Fast and Scalable Parallel External-Memory Construction of Colored Compacted de Bruijn Graphs with Cuttlefish 3.

bioRxiv. 2025 Feb 6:2025.02.02.636161. doi: 10.1101/2025.02.02.636161.

MUSET: set of utilities for constructing abundance unitig matrices from sequencing data.

Bioinformatics. 2025 Mar 4;41(3). doi: 10.1093/bioinformatics/btaf054.

Flowtigs: Safety in flow decompositions for assembly graphs.

iScience. 2024 Oct 25;27(12):111208. doi: 10.1016/j.isci.2024.111208. eCollection 2024 Dec 20.

The backpack quotient filter: A dynamic and space-efficient data structure for querying -mers with abundance.

iScience. 2024 Nov 23;27(12):111435. doi: 10.1016/j.isci.2024.111435. eCollection 2024 Dec 20.

本文引用的文献

KMC 2: fast and resource-frugal k-mer counting.

Bioinformatics. 2015 May 15;31(10):1569-76. doi: 10.1093/bioinformatics/btv022. Epub 2015 Jan 20.

SWAP-Assembler: scalable and efficient genome assembly towards thousands of cores.

BMC Bioinformatics. 2014;15 Suppl 9(Suppl 9):S2. doi: 10.1186/1471-2105-15-S9-S2. Epub 2014 Sep 10.

GATB: Genome Assembly & Analysis Tool Box.

Bioinformatics. 2014 Oct 15;30(20):2959-61. doi: 10.1093/bioinformatics/btu406. Epub 2014 Jul 1.

Sequencing and assembly of the 22-gb loblolly pine genome.

Genetics. 2014 Mar;196(3):875-90. doi: 10.1534/genetics.113.159715.

Comparing memory-efficient genome assemblers on stand-alone and cloud infrastructures.

PLoS One. 2013 Sep 27;8(9):e75505. doi: 10.1371/journal.pone.0075505. eCollection 2013.

Space-efficient and exact de Bruijn graph representation based on a Bloom filter.

Algorithms Mol Biol. 2013 Sep 16;8(1):22. doi: 10.1186/1748-7188-8-22.

Assembling the 20 Gb white spruce (Picea glauca) genome from whole-genome shotgun sequencing data.

Bioinformatics. 2013 Jun 15;29(12):1492-7. doi: 10.1093/bioinformatics/btt178. Epub 2013 May 22.

SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.

Gigascience. 2012 Dec 27;1(1):18. doi: 10.1186/2047-217X-1-18.

DSK: k-mer counting with very low memory usage.

Bioinformatics. 2013 Mar 1;29(5):652-3. doi: 10.1093/bioinformatics/btt020. Epub 2013 Jan 16.

TIGER: tiled iterative genome assembler.

BMC Bioinformatics. 2012;13 Suppl 19(Suppl 19):S18. doi: 10.1186/1471-2105-13-S19-S18. Epub 2012 Dec 19.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

从测序数据中快速且低内存地压缩德布鲁因图。

Compacting de Bruijn graphs from sequencing data quickly and in low memory.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

CONTACT

动机

结果

可用性和实现

联系方式

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献