School of Computer Science, Tel-Aviv University.
Bioinformatics. 2014 Dec 15;30(24):3515-23. doi: 10.1093/bioinformatics/btu578. Epub 2014 Sep 2.
New sequencing technologies generate larger amount of short reads data at decreasing cost. De novo sequence assembly is the problem of combining these reads back to the original genome sequence, without relying on a reference genome. This presents algorithmic and computational challenges, especially for long and repetitive genome sequences. Most existing approaches to the assembly problem operate in the framework of de Bruijn graphs. Yet, a number of recent works use the paradigm of string graph, using a variety of methods for storing and processing suffixes and prefixes, like suffix arrays, the Burrows-Wheeler transform or the FM index. Our work is motivated by a search for new approaches to constructing the string graph, using alternative yet simple data structures and algorithmic concepts.
We introduce a novel hash-based method for constructing the string graph. We use incremental hashing, and specifically a modification of the Karp-Rabin fingerprint, and Bloom filters. Using these probabilistic methods might create false-positive and false-negative edges during the algorithm's execution, but these are all detected and corrected. The advantages of the proposed approach over existing methods are its simplicity and the incorporation of established probabilistic techniques in the context of de novo genome sequencing. Our preliminary implementation is favorably comparable with the first string graph construction of Simpson and Durbin (2010) (but not with subsequent improvements). Further research and optimizations will hopefully enable the algorithm to be incorporated, with noticeable performance improvement, in state-of-the-art string graph-based assemblers.
新的测序技术以较低的成本生成了更多的短读段数据。从头序列组装是将这些读段重新组合回原始基因组序列的问题,而不依赖于参考基因组。这带来了算法和计算方面的挑战,特别是对于长的和重复的基因组序列。现有的大多数组装问题的方法都在 de Bruijn 图的框架内运行。然而,一些最近的工作使用字符串图的范例,使用各种方法来存储和处理后缀和前缀,如后缀数组、Burrows-Wheeler 变换或 FM 索引。我们的工作是受到寻找构建字符串图的新方法的启发,使用替代但简单的数据结构和算法概念。
我们引入了一种新颖的基于哈希的方法来构建字符串图。我们使用增量哈希,特别是 Karp-Rabin 指纹和布隆过滤器的修改。在算法执行过程中,使用这些概率方法可能会创建假阳性和假阴性边,但这些边都被检测和纠正。与现有方法相比,所提出的方法的优点是其简单性以及在从头测序的背景下结合了已建立的概率技术。我们的初步实现与 Simpson 和 Durbin(2010 年)的第一个字符串图构建(但与随后的改进不同)相当。进一步的研究和优化有望使该算法能够以显著的性能提升,被纳入基于字符串图的最新组装器中。