Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269, USA.
Bioinformatics. 2011 Jul 15;27(14):1901-7. doi: 10.1093/bioinformatics/btr321. Epub 2011 Jun 2.
Exact-match overlap graphs have been broadly used in the context of DNA assembly and the shortest super string problem where the number of strings n ranges from thousands to billions. The length ℓ of the strings is from 25 to 1000, depending on the DNA sequencing technologies. However, many DNA assemblers using overlap graphs suffer from the need for too much time and space in constructing the graphs. It is nearly impossible for these DNA assemblers to handle the huge amount of data produced by the next-generation sequencing technologies where the number n of strings could be several billions. If the overlap graph is explicitly stored, it would require Ω(n(2)) memory, which could be prohibitive in practice when n is greater than a hundred million. In this article, we propose a novel data structure using which the overlap graph can be compactly stored. This data structure requires only linear time to construct and and linear memory to store.
For a given set of input strings (also called reads), we can informally define an exact-match overlap graph as follows. Each read is represented as a node in the graph and there is an edge between two nodes if the corresponding reads overlap sufficiently. A formal description follows. The maximal exact-match overlap of two strings x and y, denoted by ov(max)(x, y), is the longest string which is a suffix of x and a prefix of y. The exact-match overlap graph of n given strings of length ℓ is an edge-weighted graph in which each vertex is associated with a string and there is an edge (x, y) of weight ω=ℓ-|ov(max)(x, y)| if and only if ω ≤ λ, where |ov(max)(x, y)| is the length of ov(max)(x, y) and λ is a given threshold. In this article, we show that the exact-match overlap graphs can be represented by a compact data structure that can be stored using at most (2λ-1)(2⌈logn⌉+⌈logλ⌉)n bits with a guarantee that the basic operation of accessing an edge takes O(log λ) time. We also propose two algorithms for constructing the data structure for the exact-match overlap graph. The first algorithm runs in O(λℓnlogn) worse-case time and requires O(λ) extra memory. The second one runs in O(λℓn) time and requires O(n) extra memory. Our experimental results on a huge amount of simulated data from sequence assembly show that the data structure can be constructed efficiently in time and memory.
Our DNA sequence assembler that incorporates the data structure is freely available on the web at http://www.engr.uconn.edu/~htd06001/assembler/leap.zip
精确匹配重叠图在 DNA 组装和最短超字符串问题的背景下得到了广泛应用,其中字符串的数量 n 从数千到数十亿不等。字符串的长度 ℓ 从 25 到 1000 不等,具体取决于 DNA 测序技术。然而,许多使用重叠图的 DNA 组装器在构建图时需要花费太多的时间和空间。对于下一代测序技术产生的大量数据,这些 DNA 组装器几乎不可能处理,其中字符串的数量 n 可能是数十亿。如果显式存储重叠图,则需要 Ω(n(2))的内存,当 n 大于 1 亿时,这在实践中可能是不可行的。在本文中,我们提出了一种新的数据结构,使用该数据结构可以紧凑地存储重叠图。这种数据结构仅需要线性时间来构建,并且仅需要线性内存来存储。
对于给定的输入字符串集(也称为读取),我们可以非正式地将精确匹配重叠图定义如下。每个读取都表示为图中的一个节点,如果对应的读取重叠足够,则两个节点之间存在边。正式描述如下。两个字符串 x 和 y 的最大精确匹配重叠,记为 ov(max)(x, y),是最长的字符串,它是 x 的后缀和 y 的前缀。给定长度为 ℓ 的 n 个字符串的精确匹配重叠图是一个带权图,其中每个顶点都与一个字符串相关联,如果且仅当 ω ≤ λ 时,存在边 (x, y),权重 ω=ℓ-|ov(max)(x, y)|,其中 |ov(max)(x, y)|是 ov(max)(x, y)的长度,λ 是给定的阈值。在本文中,我们表明可以使用最多 (2λ-1)(2⌈logn⌉+⌈logλ⌉)n 位的紧凑数据结构表示精确匹配重叠图,并保证访问边的基本操作时间为 O(log λ)。我们还提出了两种用于构建精确匹配重叠图数据结构的算法。第一种算法在最坏情况下运行时间为 O(λℓnlogn),需要 O(λ)额外的内存。第二种算法在时间和内存方面都运行 O(λℓn)。我们在来自序列组装的大量模拟数据上的实验结果表明,该数据结构可以高效地构建。
我们的包含该数据结构的 DNA 序列组装器可在 http://www.engr.uconn.edu/~htd06001/assembler/leap.zip 上免费获得。