Suppr超能文献

从测序数据中快速且低内存地压缩德布鲁因图。

Compacting de Bruijn graphs from sequencing data quickly and in low memory.

作者信息

Chikhi Rayan, Limasset Antoine, Medvedev Paul

机构信息

CNRS, CRIStAL, Lille, France.

ENS Cachan Brittany, Bruz, France.

出版信息

Bioinformatics. 2016 Jun 15;32(12):i201-i208. doi: 10.1093/bioinformatics/btw279.

Abstract

MOTIVATION

As the quantity of data per sequencing experiment increases, the challenges of fragment assembly are becoming increasingly computational. The de Bruijn graph is a widely used data structure in fragment assembly algorithms, used to represent the information from a set of reads. Compaction is an important data reduction step in most de Bruijn graph based algorithms where long simple paths are compacted into single vertices. Compaction has recently become the bottleneck in assembly pipelines, and improving its running time and memory usage is an important problem.

RESULTS

We present an algorithm and a tool bcalm 2 for the compaction of de Bruijn graphs. bcalm 2 is a parallel algorithm that distributes the input based on a minimizer hashing technique, allowing for good balance of memory usage throughout its execution. For human sequencing data, bcalm 2 reduces the computational burden of compacting the de Bruijn graph to roughly an hour and 3 GB of memory. We also applied bcalm 2 to the 22 Gbp loblolly pine and 20 Gbp white spruce sequencing datasets. Compacted graphs were constructed from raw reads in less than 2 days and 40 GB of memory on a single machine. Hence, bcalm 2 is at least an order of magnitude more efficient than other available methods.

AVAILABILITY AND IMPLEMENTATION

Source code of bcalm 2 is freely available at: https://github.com/GATB/bcalm

CONTACT

rayan.chikhi@univ-lille1.fr.

摘要

动机

随着每个测序实验的数据量增加,片段组装的挑战在计算方面日益凸显。德布鲁因图是片段组装算法中广泛使用的数据结构,用于表示一组 reads 的信息。压缩是大多数基于德布鲁因图的算法中的一个重要数据缩减步骤,其中长的简单路径被压缩为单个顶点。压缩最近已成为组装流程中的瓶颈,提高其运行时间和内存使用是一个重要问题。

结果

我们提出了一种用于压缩德布鲁因图的算法和工具 bcalm 2。bcalm 2 是一种并行算法,它基于最小化哈希技术分配输入,在整个执行过程中实现良好的内存使用平衡。对于人类测序数据,bcalm 2 将压缩德布鲁因图的计算负担减少到大约一小时和 3GB 内存。我们还将 bcalm 2 应用于 22 Gbp 的火炬松和 20 Gbp 的白云杉测序数据集。在单台机器上,从原始 reads 构建压缩图不到 2 天,内存使用 40GB。因此,bcalm 2 比其他现有方法至少高效一个数量级。

可用性和实现

bcalm 2 的源代码可在以下网址免费获取:https://github.com/GATB/bcalm

联系方式

rayan.chikhi@univ-lille1.fr

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1751/4908363/299f6cf31803/btw279f1p.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验