Department of Computer Science, Stony Brook University, Stonybrook, NY, USA.
Bioinformatics. 2018 Jul 1;34(13):i169-i177. doi: 10.1093/bioinformatics/bty292.
Indexing reference sequences for search-both individual genomes and collections of genomes-is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to represent multiple references using a graphical structure, and to collapse highly-repetitive sequence regions. Yet, much less attention has been given as to how to best index such a structure, such that queries can be performed efficiently and memory usage remains practical as the size and number of reference sequences being indexed grows large.
We present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing and making use of succinct representations where applicable, our data structure provides practically fast lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme for this index, which provides the ability to trade off query speed for a reduction in the index size. We believe this representation strikes a desirable balance between speed and space usage, and allows for fast search on large reference sequences. Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment.
pufferfish is written in C++11, is open source, and is available at https://github.com/COMBINE-lab/pufferfish.
Supplementary data are available at Bioinformatics online.
为了搜索个体基因组和基因组集合,对参考序列进行索引是许多序列分析任务的重要基础。许多工作都致力于基于后缀数组、BWT 和 FM-index 等数据结构为基因组序列开发全文索引。然而,由于其能够使用图形结构表示多个参考序列以及压缩高度重复的序列区域的自然能力,最近 de Bruijn 图作为索引数据结构引起了关注。然而,关于如何最好地索引这种结构以有效地进行查询并且随着要索引的参考序列的大小和数量的增加而保持实用的内存使用,人们关注得较少。
我们提出了一种用于表示和索引压缩彩色 de Bruijn 图的新颖数据结构,该结构允许有效地进行模式匹配和检索与每个 k-mer 相关联的参考信息。随着 de Bruijn 图作为索引的普及度在过去几年中增加,对这种结构的表示形式的提议数量也增加了。现有的结构通常分为两类;一类是基于哈希的,提供对基础 k-mer 信息的快速访问,另一类是节省空间的,提供渐近有效的但实际上较慢的模式搜索。我们的表示形式在这两个极端之间取得了折衷。通过构建基于最小完美哈希并在适用的地方使用简洁表示,我们的数据结构在大大减少与传统基于哈希的实现相比的空间的同时提供了实际快速的查找。此外,我们描述了此索引的抽样方案,该方案提供了在查询速度和索引大小减少之间进行权衡的能力。我们相信这种表示形式在速度和空间使用之间取得了理想的平衡,并允许对大型参考序列进行快速搜索。最后,我们描述了此索引在分类阅读分配问题中的应用。我们表明,通过采用本质上类似于 Kraken 的方法,但用链一致的最大匹配的覆盖来代替 k-mer 的存在,我们可以提高分类阅读分配的空间、速度和准确性。
pufferfish 是用 C++11 编写的,是开源的,并可在 https://github.com/COMBINE-lab/pufferfish 上获得。
补充数据可在《生物信息学》在线获得。