Chikhi Rayan, Limasset Antoine, Medvedev Paul
CNRS, CRIStAL, Lille, France.
ENS Cachan Brittany, Bruz, France.
Bioinformatics. 2016 Jun 15;32(12):i201-i208. doi: 10.1093/bioinformatics/btw279.
As the quantity of data per sequencing experiment increases, the challenges of fragment assembly are becoming increasingly computational. The de Bruijn graph is a widely used data structure in fragment assembly algorithms, used to represent the information from a set of reads. Compaction is an important data reduction step in most de Bruijn graph based algorithms where long simple paths are compacted into single vertices. Compaction has recently become the bottleneck in assembly pipelines, and improving its running time and memory usage is an important problem.
We present an algorithm and a tool bcalm 2 for the compaction of de Bruijn graphs. bcalm 2 is a parallel algorithm that distributes the input based on a minimizer hashing technique, allowing for good balance of memory usage throughout its execution. For human sequencing data, bcalm 2 reduces the computational burden of compacting the de Bruijn graph to roughly an hour and 3 GB of memory. We also applied bcalm 2 to the 22 Gbp loblolly pine and 20 Gbp white spruce sequencing datasets. Compacted graphs were constructed from raw reads in less than 2 days and 40 GB of memory on a single machine. Hence, bcalm 2 is at least an order of magnitude more efficient than other available methods.
Source code of bcalm 2 is freely available at: https://github.com/GATB/bcalm
随着每个测序实验的数据量增加,片段组装的挑战在计算方面日益凸显。德布鲁因图是片段组装算法中广泛使用的数据结构,用于表示一组 reads 的信息。压缩是大多数基于德布鲁因图的算法中的一个重要数据缩减步骤,其中长的简单路径被压缩为单个顶点。压缩最近已成为组装流程中的瓶颈,提高其运行时间和内存使用是一个重要问题。
我们提出了一种用于压缩德布鲁因图的算法和工具 bcalm 2。bcalm 2 是一种并行算法,它基于最小化哈希技术分配输入,在整个执行过程中实现良好的内存使用平衡。对于人类测序数据,bcalm 2 将压缩德布鲁因图的计算负担减少到大约一小时和 3GB 内存。我们还将 bcalm 2 应用于 22 Gbp 的火炬松和 20 Gbp 的白云杉测序数据集。在单台机器上,从原始 reads 构建压缩图不到 2 天,内存使用 40GB。因此,bcalm 2 比其他现有方法至少高效一个数量级。
bcalm 2 的源代码可在以下网址免费获取:https://github.com/GATB/bcalm