Suppr
超能文献

一种用于压缩彩色 de Bruijn 图的空间和时间高效索引。

A space and time-efficient index for the compacted colored de Bruijn graph.

机构信息

Department of Computer Science, Stony Brook University, Stonybrook, NY, USA.

出版信息

Bioinformatics. 2018 Jul 1;34(13):i169-i177. doi: 10.1093/bioinformatics/bty292.

DOI:10.1093/bioinformatics/bty292

PMID:29949982

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6022659/

Abstract

MOTIVATION

Indexing reference sequences for search-both individual genomes and collections of genomes-is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to represent multiple references using a graphical structure, and to collapse highly-repetitive sequence regions. Yet, much less attention has been given as to how to best index such a structure, such that queries can be performed efficiently and memory usage remains practical as the size and number of reference sequences being indexed grows large.

RESULTS

We present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing and making use of succinct representations where applicable, our data structure provides practically fast lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme for this index, which provides the ability to trade off query speed for a reduction in the index size. We believe this representation strikes a desirable balance between speed and space usage, and allows for fast search on large reference sequences. Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment.

AVAILABILITY AND IMPLEMENTATION

pufferfish is written in C++11, is open source, and is available at https://github.com/COMBINE-lab/pufferfish.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

为了搜索个体基因组和基因组集合，对参考序列进行索引是许多序列分析任务的重要基础。许多工作都致力于基于后缀数组、BWT 和 FM-index 等数据结构为基因组序列开发全文索引。然而，由于其能够使用图形结构表示多个参考序列以及压缩高度重复的序列区域的自然能力，最近 de Bruijn 图作为索引数据结构引起了关注。然而，关于如何最好地索引这种结构以有效地进行查询并且随着要索引的参考序列的大小和数量的增加而保持实用的内存使用，人们关注得较少。

结果

我们提出了一种用于表示和索引压缩彩色 de Bruijn 图的新颖数据结构，该结构允许有效地进行模式匹配和检索与每个 k-mer 相关联的参考信息。随着 de Bruijn 图作为索引的普及度在过去几年中增加，对这种结构的表示形式的提议数量也增加了。现有的结构通常分为两类；一类是基于哈希的，提供对基础 k-mer 信息的快速访问，另一类是节省空间的，提供渐近有效的但实际上较慢的模式搜索。我们的表示形式在这两个极端之间取得了折衷。通过构建基于最小完美哈希并在适用的地方使用简洁表示，我们的数据结构在大大减少与传统基于哈希的实现相比的空间的同时提供了实际快速的查找。此外，我们描述了此索引的抽样方案，该方案提供了在查询速度和索引大小减少之间进行权衡的能力。我们相信这种表示形式在速度和空间使用之间取得了理想的平衡，并允许对大型参考序列进行快速搜索。最后，我们描述了此索引在分类阅读分配问题中的应用。我们表明，通过采用本质上类似于 Kraken 的方法，但用链一致的最大匹配的覆盖来代替 k-mer 的存在，我们可以提高分类阅读分配的空间、速度和准确性。

可用性和实现

pufferfish 是用 C++11 编写的，是开源的，并可在 https://github.com/COMBINE-lab/pufferfish 上获得。

补充信息

补充数据可在《生物信息学》在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/911d/6022659/e394be79399b/bty292f1.jpg

相似文献

A space and time-efficient index for the compacted colored de Bruijn graph.

Bioinformatics. 2018 Jul 1;34(13):i169-i177. doi: 10.1093/bioinformatics/bty292.

Squeakr: an exact and approximate k-mer counting system.

Bioinformatics. 2018 Feb 15;34(4):568-575. doi: 10.1093/bioinformatics/btx636.

Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections.

Bioinformatics. 2021 Jul 12;37(Suppl_1):i177-i186. doi: 10.1093/bioinformatics/btab309.

Integrating long-range connectivity information into de Bruijn graphs.

Bioinformatics. 2018 Aug 1;34(15):2556-2565. doi: 10.1093/bioinformatics/bty157.

deBGR: an efficient and near-exact representation of the weighted de Bruijn graph.

Bioinformatics. 2017 Jul 15;33(14):i133-i141. doi: 10.1093/bioinformatics/btx261.

StLiter: A Novel Algorithm to Iteratively Build the Compacted de Bruijn Graph From Many Complete Genomes.

IEEE/ACM Trans Comput Biol Bioinform. 2022 Jul-Aug;19(4):2471-2483. doi: 10.1109/TCBB.2021.3062068. Epub 2022 Aug 8.

Lossless indexing with counting de Bruijn graphs.

Genome Res. 2022 Sep 27;32(9):1754-1764. doi: 10.1101/gr.276607.122.

deBWT: parallel construction of Burrows-Wheeler Transform for large collection of genomes with de Bruijn-branch encoding.

Bioinformatics. 2016 Jun 15;32(12):i174-i182. doi: 10.1093/bioinformatics/btw266.

Compact representation of k-mer de Bruijn graphs for genome read assembly.

BMC Bioinformatics. 2013 Oct 23;14:313. doi: 10.1186/1471-2105-14-313.

Meta-colored compacted de Bruijn graphs.

bioRxiv. 2023 Nov 1:2023.07.21.550101. doi: 10.1101/2023.07.21.550101.

引用本文的文献

Alevin-fry-atac enables rapid and memory frugal mapping of single-cell ATAC-seq data using virtual colors for accurate genomic pseudoalignment.

Bioinformatics. 2025 Jul 1;41(Supplement_1):i237-i245. doi: 10.1093/bioinformatics/btaf234.

Alevin-fry-atac enables rapid and memory frugal mapping of single-cell ATAC-seq data using virtual colors for accurate genomic pseudoalignment.

bioRxiv. 2025 Mar 12:2024.11.27.625771. doi: 10.1101/2024.11.27.625771.

Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs.

J Comput Biol. 2024 Oct;31(10):1022-1044. doi: 10.1089/cmb.2024.0714. Epub 2024 Oct 9.

Label-guided seed-chain-extend alignment on annotated De Bruijn graphs.

Bioinformatics. 2024 Jun 28;40(Suppl 1):i337-i346. doi: 10.1093/bioinformatics/btae226.

Indexing and searching petabase-scale nucleotide resources.

Nat Methods. 2024 Jun;21(6):994-1002. doi: 10.1038/s41592-024-02280-z. Epub 2024 May 16.

Theoretical Analysis of Sequencing Bioinformatics Algorithms and Beyond.

Commun ACM. 2023 Jul;66(7):118-125. doi: 10.1145/3571723. Epub 2023 Jun 22.

kmerDB: A database encompassing the set of genomic and proteomic sequence information for each species.

Comput Struct Biotechnol J. 2024 Apr 21;23:1919-1928. doi: 10.1016/j.csbj.2024.04.050. eCollection 2024 Dec.

Beyond the Human Genome Project: The Age of Complete Human Genome Sequences and Pangenome References.

Annu Rev Genomics Hum Genet. 2024 Aug;25(1):77-104. doi: 10.1146/annurev-genom-021623-081639. Epub 2024 Aug 6.

Pan-genome de Bruijn graph using the bidirectional FM-index.

BMC Bioinformatics. 2023 Oct 26;24(1):400. doi: 10.1186/s12859-023-05531-6.

PanKmer: k-mer-based and reference-free pangenome analysis.

Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad621.

本文引用的文献

Comprehensive benchmarking and ensemble approaches for metagenomic classifiers.

Genome Biol. 2017 Sep 21;18(1):182. doi: 10.1186/s13059-017-1299-7.

Genome graphs and the evolution of genome inference.

Genome Res. 2017 May;27(5):665-676. doi: 10.1101/gr.214155.116. Epub 2017 Mar 30.

Succinct colored de Bruijn graphs.

Bioinformatics. 2017 Oct 15;33(20):3181-3187. doi: 10.1093/bioinformatics/btx067.

TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes.

Bioinformatics. 2017 Dec 15;33(24):4024-4032. doi: 10.1093/bioinformatics/btw609.

A representation of a compressed de Bruijn graph for pan-genome analysis that enables search.

Algorithms Mol Biol. 2016 Jul 18;11:20. doi: 10.1186/s13015-016-0083-7. eCollection 2016.

deBGA: read alignment with de Bruijn graph-based seed and extension.

Bioinformatics. 2016 Nov 1;32(21):3224-3232. doi: 10.1093/bioinformatics/btw371. Epub 2016 Jul 4.

Compacting de Bruijn graphs from sequencing data quickly and in low memory.

Bioinformatics. 2016 Jun 15;32(12):i201-i208. doi: 10.1093/bioinformatics/btw279.

Read mapping on de Bruijn graphs.

BMC Bioinformatics. 2016 Jun 16;17(1):237. doi: 10.1186/s12859-016-1103-9.

Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage.

Algorithms Mol Biol. 2016 Apr 14;11:3. doi: 10.1186/s13015-016-0066-8. eCollection 2016.

Compressive mapping for next-generation sequencing.

Nat Biotechnol. 2016 Apr;34(4):374-6. doi: 10.1038/nbt.3511.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr超能文献

一种用于压缩彩色 de Bruijn 图的空间和时间高效索引。

A space and time-efficient index for the compacted colored de Bruijn graph.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

SUPPLEMENTARY INFORMATION

动机

结果

可用性和实现

补充信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译