• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

背包商数过滤器:一种用于查询具有丰度的k-mers的动态且节省空间的数据结构。

The backpack quotient filter: A dynamic and space-efficient data structure for querying -mers with abundance.

作者信息

Levallois Victor, Andreace Francesco, Le Gal Bertrand, Dufresne Yoann, Peterlongo Pierre

机构信息

University Rennes, Inria, CNRS, IRISA - UMR 6074, 35000 Rennes, France.

Department of Computational Biology, Institut Pasteur, Université Paris Cité, 75015 Paris, France.

出版信息

iScience. 2024 Nov 23;27(12):111435. doi: 10.1016/j.isci.2024.111435. eCollection 2024 Dec 20.

DOI:10.1016/j.isci.2024.111435
PMID:39720533
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11667073/
Abstract

Genomic data sequencing is crucial for understanding biological systems. As genomic databases like the European Nucleotide Archive expand exponentially, efficient data manipulation is essential. A key challenge is querying these databases to determine the presence or absence of specific sequences and their abundance within datasets. This paper presents the Backpack Quotient Filter (BQF), a data structure for indexing -mers (substrings of length ), which offers greater space efficiency than the Counting Quotient Filter (CQF). The BQF maintains essential features such as abundance information and dynamicity, with an extremely low false positive rate of less than . Our method redefines abundance information handling and implements an independent strategy for space efficiency. The BQF uses four times less space than the CQF on complex datasets such as sea-water metagenomics sequences. Additionally, its space efficiency improves with larger datasets, addressing the need for scalable data solutions.

摘要

基因组数据测序对于理解生物系统至关重要。随着诸如欧洲核苷酸档案库等基因组数据库呈指数级扩展,高效的数据处理至关重要。一个关键挑战是查询这些数据库,以确定数据集中特定序列的存在与否及其丰度。本文介绍了背包商数过滤器(BQF),一种用于索引k-mers(长度为k的子串)的数据结构,它比计数商数过滤器(CQF)具有更高的空间效率。BQF保留了诸如丰度信息和动态性等基本特征,误报率极低,小于[具体数值未给出]。我们的方法重新定义了丰度信息处理,并实现了一种独立的空间效率策略。在诸如海水宏基因组序列等复杂数据集上,BQF使用的空间比CQF少四倍。此外,随着数据集规模增大,其空间效率会提高,满足了对可扩展数据解决方案的需求。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2bf7/11667073/04824f0221f1/fx2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2bf7/11667073/15a1acc28f13/fx1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2bf7/11667073/9fdbc92261ab/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2bf7/11667073/0746883352f2/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2bf7/11667073/cc163176107f/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2bf7/11667073/5d39b54623fc/gr4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2bf7/11667073/cd8c3ee5609a/gr5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2bf7/11667073/04824f0221f1/fx2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2bf7/11667073/15a1acc28f13/fx1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2bf7/11667073/9fdbc92261ab/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2bf7/11667073/0746883352f2/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2bf7/11667073/cc163176107f/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2bf7/11667073/5d39b54623fc/gr4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2bf7/11667073/cd8c3ee5609a/gr5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2bf7/11667073/04824f0221f1/fx2.jpg

相似文献

1
The backpack quotient filter: A dynamic and space-efficient data structure for querying -mers with abundance.背包商数过滤器:一种用于查询具有丰度的k-mers的动态且节省空间的数据结构。
iScience. 2024 Nov 23;27(12):111435. doi: 10.1016/j.isci.2024.111435. eCollection 2024 Dec 20.
2
MQF and buffered MQF: quotient filters for efficient storage of k-mers with their counts and metadata.MQF 和缓冲 MQF:用于高效存储具有计数和元数据的 k-mer 的商滤波器。
BMC Bioinformatics. 2021 Feb 16;22(1):71. doi: 10.1186/s12859-021-03996-x.
3
Squeakr: an exact and approximate k-mer counting system.Squeakr:一种精确和近似的 k-mer 计数系统。
Bioinformatics. 2018 Feb 15;34(4):568-575. doi: 10.1093/bioinformatics/btx636.
4
fimpera: drastic improvement of Approximate Membership Query data-structures with counts.fimpera:使用计数极大地改进了近似成员查询数据结构。
Bioinformatics. 2023 May 4;39(5). doi: 10.1093/bioinformatics/btad305.
5
Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences.猛禽:一种用于查询超大型核苷酸序列集合的快速且节省空间的预过滤器。
iScience. 2021 Jun 24;24(7):102782. doi: 10.1016/j.isci.2021.102782. eCollection 2021 Jul 23.
6
Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage.布隆过滤器前缀树:一种用于泛基因组存储的无比对和无参考的数据结构。
Algorithms Mol Biol. 2016 Apr 14;11:3. doi: 10.1186/s13015-016-0066-8. eCollection 2016.
7
A general near-exact k-mer counting method with low memory consumption enables de novo assembly of 106× human sequence data in 2.7 hours.一种通用的、近精确的低内存消耗 k-mer 计数方法,可在 2.7 小时内完成 106×人类序列数据的从头组装。
Bioinformatics. 2020 Dec 30;36(Suppl_2):i625-i633. doi: 10.1093/bioinformatics/btaa890.
8
Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems.Kmerind:一种用于分布式内存系统上生物序列的 K-mer 索引的灵活并行库。
IEEE/ACM Trans Comput Biol Bioinform. 2019 Jul-Aug;16(4):1117-1131. doi: 10.1109/TCBB.2017.2760829. Epub 2017 Oct 9.
9
kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections.kmtricks:用于大型测序数据集的布隆过滤器的高效灵活构建
Bioinform Adv. 2022 Apr 29;2(1):vbac029. doi: 10.1093/bioadv/vbac029. eCollection 2022.
10
Efficient counting of k-mers in DNA sequences using a bloom filter.使用布隆过滤器高效计数 DNA 序列中的 k-mer。
BMC Bioinformatics. 2011 Aug 10;12:333. doi: 10.1186/1471-2105-12-333.

本文引用的文献

1
Scalable sequence database search using partitioned aggregated Bloom comb trees.基于分区聚合布隆过滤树的可扩展序列数据库搜索。
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i252-i259. doi: 10.1093/bioinformatics/btad225.
2
Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. Themisto:一种可扩展的彩色 k-mer 索引,可用于对数十万细菌基因组进行敏感的伪比对。
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i260-i269. doi: 10.1093/bioinformatics/btad233.
3
On weighted k-mer dictionaries.
关于加权k-元字典。
Algorithms Mol Biol. 2023 Jun 17;18(1):3. doi: 10.1186/s13015-023-00226-2.
4
Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT.使用 GGCAT 实现紧凑且着色的 de Bruijn 图的快速构建和查询。
Genome Res. 2023 Jul;33(7):1198-1207. doi: 10.1101/gr.277615.122. Epub 2023 May 30.
5
fimpera: drastic improvement of Approximate Membership Query data-structures with counts.fimpera:使用计数极大地改进了近似成员查询数据结构。
Bioinformatics. 2023 May 4;39(5). doi: 10.1093/bioinformatics/btad305.
6
MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants.MetaProFi:一种超快的分块布隆过滤器,用于存储和查询蛋白质和核苷酸序列数据,以准确识别功能相关的遗传变异。
Bioinformatics. 2023 Mar 1;39(3). doi: 10.1093/bioinformatics/btad101.
7
kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections.kmtricks:用于大型测序数据集的布隆过滤器的高效灵活构建
Bioinform Adv. 2022 Apr 29;2(1):vbac029. doi: 10.1093/bioadv/vbac029. eCollection 2022.
8
The European Nucleotide Archive in 2022.2022 年的欧洲核苷酸档案库。
Nucleic Acids Res. 2023 Jan 6;51(D1):D121-D125. doi: 10.1093/nar/gkac1051.
9
Sparse and skew hashing of K-mers.K- -mer 的稀疏和偏斜哈希。
Bioinformatics. 2022 Jun 24;38(Suppl 1):i185-i194. doi: 10.1093/bioinformatics/btac245.
10
Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs.Bifrost:彩色紧凑布隆图的高度并行构建和索引
Genome Biol. 2020 Sep 17;21(1):249. doi: 10.1186/s13059-020-02135-8.