Suppr超能文献

模式所在:带重复感知的彩色 de Bruijn 图压缩。

Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs.

机构信息

DAIS, Ca' Foscari University of Venice, Venice, Italy.

ISTI-CNR, Pisa, Italy.

出版信息

J Comput Biol. 2024 Oct;31(10):1022-1044. doi: 10.1089/cmb.2024.0714. Epub 2024 Oct 9.

Abstract

We describe lossless compressed data structures for the de Bruijn graph (or c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map from -mers to their . The color set of a -mer is the set of all identifiers, or , of the references that contain the -mer. While these maps find countless applications in computational biology (e.g., basic query, reading mapping, abundance estimation, etc.), their memory usage represents a serious challenge for large-scale sequence indexing. Our solutions leverage on the intrinsic repetitiveness of the color sets when indexing large collections of related genomes. Hence, the described algorithms factorize the color sets into patterns that repeat across the entire collection and represent these patterns once instead of redundantly replicating their representation as would happen if the sets were encoded as atomic lists of integers. Experimental results across a range of datasets and query workloads show that these representations substantially improve over the space effectiveness of the best previous solutions (sometimes, even dramatically, yielding indexes that are smaller by an order of magnitude). Despite the space reduction, these indexes only moderately impact the efficiency of the queries compared to the fastest indexes.

摘要

我们描述了用于 de Bruijn 图(或 c-dBG)的无损压缩数据结构。给定一组参考序列,c-dBG 可以被视为一个从 -mers 到它们的位置的映射。-mer 的颜色集是包含该 -mer 的所有标识符(或标签)的集合。虽然这些映射在计算生物学中有着无数的应用(例如基本查询、读映射、丰度估计等),但其内存使用对于大规模序列索引来说是一个严重的挑战。我们的解决方案利用了索引大量相关基因组时颜色集的内在重复性。因此,所描述的算法将颜色集分解为在整个集合中重复的模式,并一次表示这些模式,而不是像将集合编码为原子整数列表那样重复表示。在一系列数据集和查询工作负载上的实验结果表明,这些表示在空间效率方面大大优于之前最好的解决方案(有时甚至显著地,索引大小缩小了一个数量级)。尽管空间减少了,但与最快的索引相比,这些索引对查询的效率只有适度的影响。

相似文献

4
Meta-colored compacted de Bruijn graphs.元彩色压缩德布鲁因图
bioRxiv. 2023 Nov 1:2023.07.21.550101. doi: 10.1101/2023.07.21.550101.

本文引用的文献

3
Investigating the complexity of the double distance problems.研究双距离问题的复杂性。
Algorithms Mol Biol. 2024 Jan 4;19(1):1. doi: 10.1186/s13015-023-00246-y.
5
On weighted k-mer dictionaries.关于加权k-元字典。
Algorithms Mol Biol. 2023 Jun 17;18(1):3. doi: 10.1186/s13015-023-00226-2.
6
Sparse and skew hashing of K-mers.K- -mer 的稀疏和偏斜哈希。
Bioinformatics. 2022 Jun 24;38(Suppl 1):i185-i194. doi: 10.1093/bioinformatics/btac245.
7
Lossless indexing with counting de Bruijn graphs.基于计数型 de Bruijn 图的无损索引
Genome Res. 2022 Sep 27;32(9):1754-1764. doi: 10.1101/gr.276607.122.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验