IDBA-UD：一个用于具有高度不均匀深度的单细胞和宏基因组测序数据的从头组装程序。

IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth.

机构信息

Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong.

出版信息

Bioinformatics. 2012 Jun 1;28(11):1420-8. doi: 10.1093/bioinformatics/bts174. Epub 2012 Apr 11.

DOI:10.1093/bioinformatics/bts174

PMID:22495754

Abstract

MOTIVATION

Next-generation sequencing allows us to sequence reads from a microbial environment using single-cell sequencing or metagenomic sequencing technologies. However, both technologies suffer from the problem that sequencing depth of different regions of a genome or genomes from different species are highly uneven. Most existing genome assemblers usually have an assumption that sequencing depths are even. These assemblers fail to construct correct long contigs.

RESULTS

We introduce the IDBA-UD algorithm that is based on the de Bruijn graph approach for assembling reads from single-cell sequencing or metagenomic sequencing technologies with uneven sequencing depths. Several non-trivial techniques have been employed to tackle the problems. Instead of using a simple threshold, we use multiple depthrelative thresholds to remove erroneous k-mers in both low-depth and high-depth regions. The technique of local assembly with paired-end information is used to solve the branch problem of low-depth short repeat regions. To speed up the process, an error correction step is conducted to correct reads of high-depth regions that can be aligned to highconfident contigs. Comparison of the performances of IDBA-UD and existing assemblers (Velvet, Velvet-SC, SOAPdenovo and Meta-IDBA) for different datasets, shows that IDBA-UD can reconstruct longer contigs with higher accuracy.

AVAILABILITY

The IDBA-UD toolkit is available at our website http://www.cs.hku.hk/~alse/idba_ud

摘要

动机

下一代测序技术使我们能够使用单细胞测序或宏基因组测序技术对微生物环境中的读取进行测序。然而，这两种技术都存在一个问题，即基因组的不同区域或不同物种的基因组的测序深度非常不均匀。大多数现有的基因组组装器通常有一个假设，即测序深度是均匀的。这些组装器无法构建正确的长连续序列。

结果

我们介绍了 IDBA-UD 算法，它是基于 de Bruijn 图方法的，用于组装来自单细胞测序或宏基因组测序技术的具有不均匀测序深度的读取。采用了几种非平凡的技术来解决这些问题。我们不是使用简单的阈值，而是使用多个深度相关的阈值来去除低深度和高深度区域中错误的 k-mer。利用具有配对末端信息的局部组装技术来解决低深度短重复区域的分支问题。为了加快速度，对高深度区域的读取进行纠错步骤，这些读取可以与高置信度的连续序列对齐。

通过对不同数据集的 IDBA-UD 和现有组装器（Velvet、 Velvet-SC、SOAPdenovo 和 Meta-IDBA）的性能进行比较，表明 IDBA-UD 可以以更高的准确性重建更长的连续序列。