IEEE/ACM Trans Comput Biol Bioinform. 2019 Jul-Aug;16(4):1091-1106. doi: 10.1109/TCBB.2017.2737999. Epub 2017 Sep 11.
De novo genome assembly describes the process of reconstructing an unknown genome from a large collection of short (or long) reads sequenced from the genome. A single run of a Next-Generation Sequencing (NGS) technology can produce billions of short reads, making genome assembly computationally demanding (both in terms of memory and time). One of the major computational steps in modern day short read assemblers involves the construction and use of a string data structure called the de Bruijn graph. In fact, a majority of short read assemblers build the complete de Bruijn graph for the set of input reads, and subsequently traverse and prune low-quality edges, in order to generate genomic "contigs"-the output of assembly. These steps of graph construction and traversal, contribute to well over 90 percent of the runtime and memory. In this paper, we present a fast algorithm, FastEtch, that uses sketching to build an approximate version of the de Bruijn graph for the purpose of generating an assembly. The algorithm uses Count-Min sketch, which is a probabilistic data structure for streaming data sets. The result is an approximate de Bruijn graph that stores information pertaining only to a selected subset of nodes that are most likely to contribute to the contig generation step. In addition, edges are not stored; instead that fraction which contribute to our contig generation are detected on-the-fly. This approximate approach is intended to significantly improve performance (both execution time and memory footprint) whilst possibly compromising on the output assembly quality. We present two main versions of the assembler-one that generates an assembly, where each contig represents a contiguous genomic region from one strand of the DNA, and another that generates an assembly, where the contigs can straddle either of the two strands of the DNA. For further scalability, we have implemented a multi-threaded parallel code. Experimental results using our algorithm conducted on E. coli, Yeast, C. elegans, and Human (Chr2 and Chr2+3) genomes show that our method yields one of the best time-memory-quality trade-offs, when compared against many state-of-the-art genome assemblers.
从头基因组组装描述了从大量测序的基因组短(或长)读段中重建未知基因组的过程。下一代测序(NGS)技术的单次运行可以产生数十亿个短读段,这使得基因组组装在计算上具有挑战性(无论是在内存还是时间方面)。现代短读段组装器的主要计算步骤之一涉及构建和使用称为 de Bruijn 图的字符串数据结构。事实上,大多数短读段组装器都会为输入读段构建完整的 de Bruijn 图,然后遍历和修剪低质量的边,以生成基因组“拼接体”-组装的输出。图构建和遍历的这些步骤贡献了超过 90%的运行时间和内存。在本文中,我们提出了一种快速算法 FastEtch,它使用草图来构建 de Bruijn 图的近似版本,以生成组装。该算法使用 Count-Min 草图,这是一种用于流数据集的概率数据结构。结果是一个近似的 de Bruijn 图,仅存储与最有可能有助于拼接体生成步骤的选定节点子集相关的信息。此外,不存储边;而是在生成拼接体时实时检测到有助于生成拼接体的那部分边。这种近似方法旨在显著提高性能(执行时间和内存占用),同时可能会影响输出组装质量。我们提出了两种主要的组装器版本:一种生成组装,其中每个拼接体代表 DNA 一条链上的连续基因组区域;另一种生成组装,其中拼接体可以跨越 DNA 的两条链中的任意一条。为了进一步提高可扩展性,我们实现了一个多线程并行代码。使用我们的算法在大肠杆菌、酵母、秀丽隐杆线虫和人类(Chr2 和 Chr2+3)基因组上进行的实验结果表明,与许多最先进的基因组组装器相比,我们的方法在时间-内存-质量折衷方面取得了最佳结果之一。