Suppr超能文献

使用德布鲁因图搜索实现高维颜色信息的高效、可扩展且精确表示。

An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search.

作者信息

Almodaresi Fatemeh, Pandey Prashant, Ferdman Michael, Johnson Rob, Patro Rob

机构信息

Department of Computer Science, University of Maryland, College Park, Maryland.

School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania.

出版信息

J Comput Biol. 2020 Apr;27(4):485-499. doi: 10.1089/cmb.2019.0322. Epub 2020 Mar 16.

Abstract

The colored de Bruijn graph (cdbg) and its variants have become an important combinatorial structure used in numerous areas in genomics, such as population-level variation detection in metagenomic samples, large-scale sequence search, and cdbg-based reference sequence indices. As samples or genomes are added to the cdbg, the color information comes to dominate the space required to represent this data structure. In this article, we show how to represent the color information efficiently by adopting a hierarchical encoding that exploits correlations among color classes-patterns of color occurrence-present in the de Bruijn graph (dbg). A major challenge in deriving an efficient encoding of the color information that takes advantage of such correlations is determining which color classes are close to each other in the high-dimensional space of possible color patterns. We demonstrate that the dbg itself can be used as an efficient mechanism to search for approximate nearest neighbors in this space. While our approach reduces the encoding size of the color information even for relatively small cdbgs (hundreds of experiments), the gains are particularly consequential as the number of potential colors (i.e., samples or references) grows into thousands. We apply this encoding in the context of two different applications; the implicit cdbg used for a large-scale sequence search index, Mantis, as well as the encoding of color information used in population-level variation detection by tools such as Vari and Rainbowfish. Our results show significant improvements in the overall size and scalability of representation of the color information. In our experiment on 10,000 samples, we achieved >11 × better compression compared to Ramen, Ramen, Rao (RRR).

摘要

彩色德布鲁因图(cdbg)及其变体已成为基因组学众多领域中使用的一种重要组合结构,例如宏基因组样本中的群体水平变异检测、大规模序列搜索以及基于cdbg的参考序列索引。随着样本或基因组被添加到cdbg中,颜色信息开始主导表示此数据结构所需的空间。在本文中,我们展示了如何通过采用分层编码来高效表示颜色信息,该编码利用了德布鲁因图(dbg)中颜色类之间的相关性——颜色出现的模式。在推导利用此类相关性的颜色信息高效编码时,一个主要挑战是确定在可能的颜色模式的高维空间中哪些颜色类彼此接近。我们证明dbg本身可以用作在此空间中搜索近似最近邻的有效机制。虽然我们的方法即使对于相对较小的cdbg(数百个实验)也能减少颜色信息的编码大小,但随着潜在颜色的数量(即样本或参考)增长到数千个,收益尤为显著。我们在两种不同的应用场景中应用此编码;用于大规模序列搜索索引Mantis的隐式cdbg,以及Vari和Rainbowfish等工具在群体水平变异检测中使用的颜色信息编码。我们的结果表明,颜色信息表示的整体大小和可扩展性有显著改进。在我们对10000个样本的实验中,与Ramen、Rao(RRR)相比,我们实现了超过11倍的更好压缩。

相似文献

2
Building large updatable colored de Bruijn graphs via merging.通过合并构建大型可更新彩色 de Bruijn 图。
Bioinformatics. 2019 Jul 15;35(14):i51-i60. doi: 10.1093/bioinformatics/btz350.
5
deGSM: Memory Scalable Construction Of Large Scale de Bruijn Graph.deGSM:大规模 de Bruijn 图的可扩展存储构建。
IEEE/ACM Trans Comput Biol Bioinform. 2021 Nov-Dec;18(6):2157-2166. doi: 10.1109/TCBB.2019.2913932. Epub 2021 Dec 8.
6
Integrating long-range connectivity information into de Bruijn graphs.将长程连接信息整合到 de Bruijn 图中。
Bioinformatics. 2018 Aug 1;34(15):2556-2565. doi: 10.1093/bioinformatics/bty157.
7
Meta-colored compacted de Bruijn graphs.元彩色压缩德布鲁因图
bioRxiv. 2023 Nov 1:2023.07.21.550101. doi: 10.1101/2023.07.21.550101.

引用本文的文献

4
Compression algorithm for colored de Bruijn graphs.彩色德布鲁因图的压缩算法。
Algorithms Mol Biol. 2024 May 26;19(1):20. doi: 10.1186/s13015-024-00254-6.
5
Compression Algorithm for Colored de Bruijn Graphs.彩色德布鲁因图的压缩算法
Lebniz Int Proc Inform. 2023 Sep;273. doi: 10.4230/LIPIcs.WABI.2023.17. Epub 2023 Aug 29.

本文引用的文献

1
Ultrafast search of all deposited bacterial and viral genomic data.快速搜索所有已存入的细菌和病毒基因组数据。
Nat Biotechnol. 2019 Feb;37(2):152-159. doi: 10.1038/s41587-018-0010-1. Epub 2019 Feb 4.
2
SeqOthello: querying RNA-seq experiments at scale.SeqOthello:大规模查询 RNA-seq 实验。
Genome Biol. 2018 Oct 19;19(1):167. doi: 10.1186/s13059-018-1535-9.
3
Dynamic compression schemes for graph coloring.用于图着色的动态压缩方案。
Bioinformatics. 2019 Feb 1;35(3):407-414. doi: 10.1093/bioinformatics/bty632.
5
Practical dynamic de Bruijn graphs.实用动态 de Bruijn 图。
Bioinformatics. 2018 Dec 15;34(24):4189-4195. doi: 10.1093/bioinformatics/bty500.
8
AllSome Sequence Bloom Trees.所有一些序列布隆树。
J Comput Biol. 2018 May;25(5):467-479. doi: 10.1089/cmb.2017.0258. Epub 2018 Apr 5.
9
Integrating long-range connectivity information into de Bruijn graphs.将长程连接信息整合到 de Bruijn 图中。
Bioinformatics. 2018 Aug 1;34(15):2556-2565. doi: 10.1093/bioinformatics/bty157.
10
Squeakr: an exact and approximate k-mer counting system.Squeakr:一种精确和近似的 k-mer 计数系统。
Bioinformatics. 2018 Feb 15;34(4):568-575. doi: 10.1093/bioinformatics/btx636.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验