使用德布鲁因图搜索实现高维颜色信息的高效、可扩展且精确表示。

An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search.

作者信息

Almodaresi Fatemeh, Pandey Prashant, Ferdman Michael, Johnson Rob, Patro Rob

机构信息

Department of Computer Science, University of Maryland, College Park, Maryland.

School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania.

出版信息

J Comput Biol. 2020 Apr;27(4):485-499. doi: 10.1089/cmb.2019.0322. Epub 2020 Mar 16.

DOI:10.1089/cmb.2019.0322

PMID:32176522

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7185321/

Abstract

The colored de Bruijn graph (cdbg) and its variants have become an important combinatorial structure used in numerous areas in genomics, such as population-level variation detection in metagenomic samples, large-scale sequence search, and cdbg-based reference sequence indices. As samples or genomes are added to the cdbg, the color information comes to dominate the space required to represent this data structure. In this article, we show how to represent the color information efficiently by adopting a hierarchical encoding that exploits correlations among color classes-patterns of color occurrence-present in the de Bruijn graph (dbg). A major challenge in deriving an efficient encoding of the color information that takes advantage of such correlations is determining which color classes are close to each other in the high-dimensional space of possible color patterns. We demonstrate that the dbg itself can be used as an efficient mechanism to search for approximate nearest neighbors in this space. While our approach reduces the encoding size of the color information even for relatively small cdbgs (hundreds of experiments), the gains are particularly consequential as the number of potential colors (i.e., samples or references) grows into thousands. We apply this encoding in the context of two different applications; the implicit cdbg used for a large-scale sequence search index, Mantis, as well as the encoding of color information used in population-level variation detection by tools such as Vari and Rainbowfish. Our results show significant improvements in the overall size and scalability of representation of the color information. In our experiment on 10,000 samples, we achieved >11 × better compression compared to Ramen, Ramen, Rao (RRR).

摘要

彩色德布鲁因图（cdbg）及其变体已成为基因组学众多领域中使用的一种重要组合结构，例如宏基因组样本中的群体水平变异检测、大规模序列搜索以及基于cdbg的参考序列索引。随着样本或基因组被添加到cdbg中，颜色信息开始主导表示此数据结构所需的空间。在本文中，我们展示了如何通过采用分层编码来高效表示颜色信息，该编码利用了德布鲁因图（dbg）中颜色类之间的相关性——颜色出现的模式。在推导利用此类相关性的颜色信息高效编码时，一个主要挑战是确定在可能的颜色模式的高维空间中哪些颜色类彼此接近。我们证明dbg本身可以用作在此空间中搜索近似最近邻的有效机制。虽然我们的方法即使对于相对较小的cdbg（数百个实验）也能减少颜色信息的编码大小，但随着潜在颜色的数量（即样本或参考）增长到数千个，收益尤为显著。我们在两种不同的应用场景中应用此编码；用于大规模序列搜索索引Mantis的隐式cdbg，以及Vari和Rainbowfish等工具在群体水平变异检测中使用的颜色信息编码。我们的结果表明，颜色信息表示的整体大小和可扩展性有显著改进。在我们对10000个样本的实验中，与Ramen、Rao（RRR）相比，我们实现了超过11倍的更好压缩。

相似文献

An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search.

J Comput Biol. 2020 Apr;27(4):485-499. doi: 10.1089/cmb.2019.0322. Epub 2020 Mar 16.

Building large updatable colored de Bruijn graphs via merging.

Bioinformatics. 2019 Jul 15;35(14):i51-i60. doi: 10.1093/bioinformatics/btz350.

A space and time-efficient index for the compacted colored de Bruijn graph.

Bioinformatics. 2018 Jul 1;34(13):i169-i177. doi: 10.1093/bioinformatics/bty292.

Simplitigs as an efficient and scalable representation of de Bruijn graphs.

Genome Biol. 2021 Apr 6;22(1):96. doi: 10.1186/s13059-021-02297-z.

deGSM: Memory Scalable Construction Of Large Scale de Bruijn Graph.

IEEE/ACM Trans Comput Biol Bioinform. 2021 Nov-Dec;18(6):2157-2166. doi: 10.1109/TCBB.2019.2913932. Epub 2021 Dec 8.

Integrating long-range connectivity information into de Bruijn graphs.

Bioinformatics. 2018 Aug 1;34(15):2556-2565. doi: 10.1093/bioinformatics/bty157.

Meta-colored compacted de Bruijn graphs.

bioRxiv. 2023 Nov 1:2023.07.21.550101. doi: 10.1101/2023.07.21.550101.

Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs.

J Comput Biol. 2024 Oct;31(10):1022-1044. doi: 10.1089/cmb.2024.0714. Epub 2024 Oct 9.

Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs.

BMC Bioinformatics. 2010 Nov 15;11:560. doi: 10.1186/1471-2105-11-560.

Enhanced Compression of -Mer Sets with Counters via de Bruijn Graphs.

J Comput Biol. 2024 Jun;31(6):524-538. doi: 10.1089/cmb.2024.0530. Epub 2024 May 31.

引用本文的文献

Movi Color: fast and accurate long-read classification with the move structure.

bioRxiv. 2025 May 27:2025.05.22.655637. doi: 10.1101/2025.05.22.655637.

Fast and Scalable Parallel External-Memory Construction of Colored Compacted de Bruijn Graphs with Cuttlefish 3.

bioRxiv. 2025 Feb 6:2025.02.02.636161. doi: 10.1101/2025.02.02.636161.

Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs.

J Comput Biol. 2024 Oct;31(10):1022-1044. doi: 10.1089/cmb.2024.0714. Epub 2024 Oct 9.

Compression algorithm for colored de Bruijn graphs.

Algorithms Mol Biol. 2024 May 26;19(1):20. doi: 10.1186/s13015-024-00254-6.

Compression Algorithm for Colored de Bruijn Graphs.

Lebniz Int Proc Inform. 2023 Sep;273. doi: 10.4230/LIPIcs.WABI.2023.17. Epub 2023 Aug 29.

Fulgor: a fast and compact k-mer index for large-scale matching and color queries.

Algorithms Mol Biol. 2024 Jan 22;19(1):3. doi: 10.1186/s13015-024-00251-9.

Hierarchical Interleaved Bloom Filter: enabling ultrafast, approximate sequence queries.

Genome Biol. 2023 May 31;24(1):131. doi: 10.1186/s13059-023-02971-4.

Fulgor: A fast and compact -mer index for large-scale matching and color queries.

bioRxiv. 2023 May 20:2023.05.09.539895. doi: 10.1101/2023.05.09.539895.

An incrementally updatable and scalable system for large-scale sequence search using the Bentley-Saxe transformation.

Bioinformatics. 2022 Jun 13;38(12):3155-3163. doi: 10.1093/bioinformatics/btac142.

VariantStore: an index for large-scale genomic variant search.

Genome Biol. 2021 Aug 19;22(1):231. doi: 10.1186/s13059-021-02442-8.

本文引用的文献

Ultrafast search of all deposited bacterial and viral genomic data.

Nat Biotechnol. 2019 Feb;37(2):152-159. doi: 10.1038/s41587-018-0010-1. Epub 2019 Feb 4.

SeqOthello: querying RNA-seq experiments at scale.

Genome Biol. 2018 Oct 19;19(1):167. doi: 10.1186/s13059-018-1535-9.

Dynamic compression schemes for graph coloring.

Bioinformatics. 2019 Feb 1;35(3):407-414. doi: 10.1093/bioinformatics/bty632.

A space and time-efficient index for the compacted colored de Bruijn graph.

Bioinformatics. 2018 Jul 1;34(13):i169-i177. doi: 10.1093/bioinformatics/bty292.

Practical dynamic de Bruijn graphs.

Bioinformatics. 2018 Dec 15;34(24):4189-4195. doi: 10.1093/bioinformatics/bty500.

Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index.

Cell Syst. 2018 Aug 22;7(2):201-207.e4. doi: 10.1016/j.cels.2018.05.021. Epub 2018 Jun 20.

Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees.

J Comput Biol. 2018 Jul;25(7):755-765. doi: 10.1089/cmb.2017.0265. Epub 2018 Mar 12.

AllSome Sequence Bloom Trees.

J Comput Biol. 2018 May;25(5):467-479. doi: 10.1089/cmb.2017.0258. Epub 2018 Apr 5.

Integrating long-range connectivity information into de Bruijn graphs.

Bioinformatics. 2018 Aug 1;34(15):2556-2565. doi: 10.1093/bioinformatics/bty157.

Squeakr: an exact and approximate k-mer counting system.

Bioinformatics. 2018 Feb 15;34(4):568-575. doi: 10.1093/bioinformatics/btx636.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用德布鲁因图搜索实现高维颜色信息的高效、可扩展且精确表示。

An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献