• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

彩色德布鲁因图的压缩算法

Compression Algorithm for Colored de Bruijn Graphs.

作者信息

Rahman Amatur, Dufresne Yoann, Medvedev Paul

机构信息

Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA.

Institut Pasteur, Université Paris Cité, G5 Sequence Bioinformatics, Paris, France.

出版信息

Lebniz Int Proc Inform. 2023 Sep;273. doi: 10.4230/LIPIcs.WABI.2023.17. Epub 2023 Aug 29.

DOI:10.4230/LIPIcs.WABI.2023.17
PMID:38712341
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11071130/
Abstract

A colored de Bruijn graph (also called a set of k-mer sets), is a set of k-mers with every k-mer assigned a set of colors. Colored de Bruijn graphs are used in a variety of applications, including variant calling, genome assembly, and database search. However, their size has posed a scalability challenge to algorithm developers and users. There have been numerous indexing data structures proposed that allow to store the graph compactly while supporting fast query operations. However, disk compression algorithms, which do not need to support queries on the compressed data and can thus be more space-efficient, have received little attention. The dearth of specialized compression tools has been a detriment to tool developers, tool users, and reproducibility efforts. In this paper, we develop a new tool that compresses colored de Bruijn graphs to disk, building on previous ideas for compression of k-mer sets and indexing colored de Bruijn graphs. We test our tool, called ESS-color, on various datasets, including both sequencing data and whole genomes. ESS-color achieves better compression than all evaluated tools and all datasets, with no other tool able to consistently achieve less than 44% space overhead.

摘要

彩色德布鲁因图(也称为k-mer集的集合)是一组k-mer,每个k-mer都被赋予一组颜色。彩色德布鲁因图被用于多种应用中,包括变异检测、基因组组装和数据库搜索。然而,它们的规模给算法开发者和用户带来了可扩展性挑战。已经提出了许多索引数据结构,这些结构允许在支持快速查询操作的同时紧凑地存储图。然而,磁盘压缩算法由于不需要支持对压缩数据的查询,因此可以更节省空间,但却很少受到关注。缺乏专门的压缩工具对工具开发者、工具用户和可重复性工作都造成了不利影响。在本文中,我们基于之前对k-mer集压缩和彩色德布鲁因图索引的想法,开发了一种新工具,将彩色德布鲁因图压缩到磁盘上。我们在各种数据集上测试了我们的工具ESS-color,包括测序数据和全基因组。ESS-color在所有评估工具和所有数据集上都实现了更好的压缩效果,没有其他工具能够始终实现低于44%的空间开销。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3b17/11071130/4af5ab968016/nihms-1985036-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3b17/11071130/3120109ac3ab/nihms-1985036-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3b17/11071130/4af5ab968016/nihms-1985036-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3b17/11071130/3120109ac3ab/nihms-1985036-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3b17/11071130/4af5ab968016/nihms-1985036-f0002.jpg

相似文献

1
Compression Algorithm for Colored de Bruijn Graphs.彩色德布鲁因图的压缩算法
Lebniz Int Proc Inform. 2023 Sep;273. doi: 10.4230/LIPIcs.WABI.2023.17. Epub 2023 Aug 29.
2
Compression algorithm for colored de Bruijn graphs.彩色德布鲁因图的压缩算法。
Algorithms Mol Biol. 2024 May 26;19(1):20. doi: 10.1186/s13015-024-00254-6.
3
Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs.模式所在:带重复感知的彩色 de Bruijn 图压缩。
J Comput Biol. 2024 Oct;31(10):1022-1044. doi: 10.1089/cmb.2024.0714. Epub 2024 Oct 9.
4
Disk compression of k-mer sets.k-mer集的磁盘压缩
Algorithms Mol Biol. 2021 Jun 21;16(1):10. doi: 10.1186/s13015-021-00192-7.
5
Lossless indexing with counting de Bruijn graphs.基于计数型 de Bruijn 图的无损索引
Genome Res. 2022 Sep 27;32(9):1754-1764. doi: 10.1101/gr.276607.122.
6
Where the patterns are: repetition-aware compression for colored de Bruijn graphs .其模式为:彩色德布鲁因图的重复感知压缩。
bioRxiv. 2024 Jul 13:2024.07.09.602727. doi: 10.1101/2024.07.09.602727.
7
Meta-colored compacted de Bruijn graphs.元彩色压缩德布鲁因图
bioRxiv. 2023 Nov 1:2023.07.21.550101. doi: 10.1101/2023.07.21.550101.
8
Fulgor: A fast and compact -mer index for large-scale matching and color queries.富尔戈尔:一种用于大规模匹配和颜色查询的快速紧凑的k-mer索引。
bioRxiv. 2023 May 20:2023.05.09.539895. doi: 10.1101/2023.05.09.539895.
9
Simplitigs as an efficient and scalable representation of de Bruijn graphs.Simplitigs 作为一种高效且可扩展的 de Bruijn 图表示方法。
Genome Biol. 2021 Apr 6;22(1):96. doi: 10.1186/s13059-021-02297-z.
10
Fulgor: a fast and compact k-mer index for large-scale matching and color queries.Fulgor:一种用于大规模匹配和颜色查询的快速紧凑的k-mer索引。
Algorithms Mol Biol. 2024 Jan 22;19(1):3. doi: 10.1186/s13015-024-00251-9.

本文引用的文献

1
Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. Themisto:一种可扩展的彩色 k-mer 索引,可用于对数十万细菌基因组进行敏感的伪比对。
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i260-i269. doi: 10.1093/bioinformatics/btad233.
2
Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT.使用 GGCAT 实现紧凑且着色的 de Bruijn 图的快速构建和查询。
Genome Res. 2023 Jul;33(7):1198-1207. doi: 10.1101/gr.277615.122. Epub 2023 May 30.
3
Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2.
使用 Cuttlefish 2 实现可扩展、超快速和低内存消耗的紧凑 de Bruijn 图构建。
Genome Biol. 2022 Sep 8;23(1):190. doi: 10.1186/s13059-022-02743-6.
4
The K-mer File Format: a standardized and compact disk representation of sets of k-mers.K-mer 文件格式:一种用于表示 K-mer 集合的标准化、紧凑的磁盘表示形式。
Bioinformatics. 2022 Sep 15;38(18):4423-4425. doi: 10.1093/bioinformatics/btac528.
5
Sparse and skew hashing of K-mers.K- -mer 的稀疏和偏斜哈希。
Bioinformatics. 2022 Jun 24;38(Suppl 1):i185-i194. doi: 10.1093/bioinformatics/btac245.
6
Lossless indexing with counting de Bruijn graphs.基于计数型 de Bruijn 图的无损索引
Genome Res. 2022 Sep 27;32(9):1754-1764. doi: 10.1101/gr.276607.122.
7
Topology-based sparsification of graph annotations.基于图注释的拓扑简化。
Bioinformatics. 2021 Jul 12;37(Suppl_1):i169-i176. doi: 10.1093/bioinformatics/btab330.
8
Disk compression of k-mer sets.k-mer集的磁盘压缩
Algorithms Mol Biol. 2021 Jun 21;16(1):10. doi: 10.1186/s13015-021-00192-7.
9
Simplitigs as an efficient and scalable representation of de Bruijn graphs.Simplitigs 作为一种高效且可扩展的 de Bruijn 图表示方法。
Genome Biol. 2021 Apr 6;22(1):96. doi: 10.1186/s13059-021-02297-z.
10
Data structures based on -mers for querying large collections of sequencing data sets.基于 - 元的序列数据集查询的大型数据集的数据结构。
Genome Res. 2021 Jan;31(1):1-12. doi: 10.1101/gr.260604.119. Epub 2020 Dec 16.