Almodaresi Fatemeh, Pandey Prashant, Ferdman Michael, Johnson Rob, Patro Rob
Department of Computer Science, University of Maryland, College Park, Maryland.
School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania.
J Comput Biol. 2020 Apr;27(4):485-499. doi: 10.1089/cmb.2019.0322. Epub 2020 Mar 16.
The colored de Bruijn graph (cdbg) and its variants have become an important combinatorial structure used in numerous areas in genomics, such as population-level variation detection in metagenomic samples, large-scale sequence search, and cdbg-based reference sequence indices. As samples or genomes are added to the cdbg, the color information comes to dominate the space required to represent this data structure. In this article, we show how to represent the color information efficiently by adopting a hierarchical encoding that exploits correlations among color classes-patterns of color occurrence-present in the de Bruijn graph (dbg). A major challenge in deriving an efficient encoding of the color information that takes advantage of such correlations is determining which color classes are close to each other in the high-dimensional space of possible color patterns. We demonstrate that the dbg itself can be used as an efficient mechanism to search for approximate nearest neighbors in this space. While our approach reduces the encoding size of the color information even for relatively small cdbgs (hundreds of experiments), the gains are particularly consequential as the number of potential colors (i.e., samples or references) grows into thousands. We apply this encoding in the context of two different applications; the implicit cdbg used for a large-scale sequence search index, Mantis, as well as the encoding of color information used in population-level variation detection by tools such as Vari and Rainbowfish. Our results show significant improvements in the overall size and scalability of representation of the color information. In our experiment on 10,000 samples, we achieved >11 × better compression compared to Ramen, Ramen, Rao (RRR).
彩色德布鲁因图(cdbg)及其变体已成为基因组学众多领域中使用的一种重要组合结构,例如宏基因组样本中的群体水平变异检测、大规模序列搜索以及基于cdbg的参考序列索引。随着样本或基因组被添加到cdbg中,颜色信息开始主导表示此数据结构所需的空间。在本文中,我们展示了如何通过采用分层编码来高效表示颜色信息,该编码利用了德布鲁因图(dbg)中颜色类之间的相关性——颜色出现的模式。在推导利用此类相关性的颜色信息高效编码时,一个主要挑战是确定在可能的颜色模式的高维空间中哪些颜色类彼此接近。我们证明dbg本身可以用作在此空间中搜索近似最近邻的有效机制。虽然我们的方法即使对于相对较小的cdbg(数百个实验)也能减少颜色信息的编码大小,但随着潜在颜色的数量(即样本或参考)增长到数千个,收益尤为显著。我们在两种不同的应用场景中应用此编码;用于大规模序列搜索索引Mantis的隐式cdbg,以及Vari和Rainbowfish等工具在群体水平变异检测中使用的颜色信息编码。我们的结果表明,颜色信息表示的整体大小和可扩展性有显著改进。在我们对10000个样本的实验中,与Ramen、Rao(RRR)相比,我们实现了超过11倍的更好压缩。