• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

电雷:一种用于近似彩色k-mer查询的资源节约型索引。

Kaminari: a resource-frugal index for approximate colored -mer queries.

作者信息

Levallois Victor, Shibuya Yoshihiro, Le Gal Bertrand, Patro Rob, Peterlongo Pierre, Ermanno Pibiri Giulio

机构信息

GenScale, University of Rennes, Inria, CNRS, IRISA - UMR 6074, Rennes, France.

Sequence Bioinformatics Unit, Institut Pasteur, F-75015, Paris, France.

出版信息

bioRxiv. 2025 May 21:2025.05.16.654317. doi: 10.1101/2025.05.16.654317.

DOI:10.1101/2025.05.16.654317
PMID:40475623
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12139927/
Abstract

MOTIVATION

The problem of identifying the set of textual documents from a given database containing a query string has been studied in various fields of computing, e.g., in Information Retrieval, Databases, and Computational Biology. We consider the approximate version of this problem, that is, the result set is allowed to contain some false positive matches (but no false negatives), and focus on the specific case where the indexed documents are DNA strings. In this setting, state-of-the-art solutions rely on Bloom filters as a way to index all -mers (substrings of length ) in the documents. To answer a query, the -mers of the query string are tested for membership against the index and documents that contain at least a user-prescribed fraction of them (e.g., 75-80%) are returned.

METHODS AND RESULTS

Here, we explore an alternative index design based on -mer minimizers and integer compression methods. We show that a careful implementation of this design outperforms previous solutions based on Bloom filters by a wide margin: the index has lower memory footprint and faster query times, while false positive matches have only a minor impact on the ranking of the documents reported. This trend is robust across genomic datasets of different complexity and query workloads.

SOFTWARE

The software is implemented in C++17 and available under the MIT license at github.com/yhhshb/kaminari. Reproducibility information and additional results are provided at github.com/vicLeva/benchmarks_kaminari.

摘要

动机

在计算的各个领域,如信息检索、数据库和计算生物学中,都研究了从给定数据库中识别包含查询字符串的文本文件集的问题。我们考虑这个问题的近似版本,即结果集允许包含一些误报匹配(但不包含漏报),并专注于索引文档为DNA字符串的特定情况。在这种情况下,目前的先进解决方案依赖布隆过滤器来索引文档中的所有k - 聚体(长度为k的子串)。为了回答查询,会针对索引测试查询字符串的k - 聚体的成员资格,并返回包含至少用户规定比例(例如75 - 80%)的k - 聚体的文档。

方法和结果

在这里,我们探索了一种基于k - 聚体最小化器和整数压缩方法的替代索引设计。我们表明,这种设计的精心实现比基于布隆过滤器的先前解决方案有很大优势:索引占用的内存更少,查询时间更快,而误报匹配对报告的文档排名只有轻微影响。这种趋势在不同复杂度的基因组数据集和查询工作负载中都很稳健。

软件

该软件用C++17实现,可在github.com/yhhshb/kaminari上根据MIT许可获取。在github.com/vicLeva/benchmarks_kaminari上提供了可重复性信息和其他结果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/148f/12139927/1fb9f0666741/nihpp-2025.05.16.654317v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/148f/12139927/6f83f64b6e8c/nihpp-2025.05.16.654317v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/148f/12139927/3e30d38ad7df/nihpp-2025.05.16.654317v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/148f/12139927/80e7ef43ca3d/nihpp-2025.05.16.654317v1-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/148f/12139927/b66d5b511781/nihpp-2025.05.16.654317v1-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/148f/12139927/ac8beaf2dd1b/nihpp-2025.05.16.654317v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/148f/12139927/a43a131154cc/nihpp-2025.05.16.654317v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/148f/12139927/1fb9f0666741/nihpp-2025.05.16.654317v1-f0003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/148f/12139927/6f83f64b6e8c/nihpp-2025.05.16.654317v1-f0004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/148f/12139927/3e30d38ad7df/nihpp-2025.05.16.654317v1-f0005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/148f/12139927/80e7ef43ca3d/nihpp-2025.05.16.654317v1-f0006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/148f/12139927/b66d5b511781/nihpp-2025.05.16.654317v1-f0007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/148f/12139927/ac8beaf2dd1b/nihpp-2025.05.16.654317v1-f0001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/148f/12139927/a43a131154cc/nihpp-2025.05.16.654317v1-f0002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/148f/12139927/1fb9f0666741/nihpp-2025.05.16.654317v1-f0003.jpg

相似文献

1
Kaminari: a resource-frugal index for approximate colored -mer queries.电雷:一种用于近似彩色k-mer查询的资源节约型索引。
bioRxiv. 2025 May 21:2025.05.16.654317. doi: 10.1101/2025.05.16.654317.
2
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
3
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.慢性斑块状银屑病的全身药理学治疗:一项网状Meta分析。
Cochrane Database Syst Rev. 2020 Jan 9;1(1):CD011535. doi: 10.1002/14651858.CD011535.pub3.
4
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病:网络荟萃分析。
Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.
5
Psychological interventions for adults who have sexually offended or are at risk of offending.针对有性犯罪行为或有性犯罪风险的成年人的心理干预措施。
Cochrane Database Syst Rev. 2012 Dec 12;12(12):CD007507. doi: 10.1002/14651858.CD007507.pub2.
6
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.慢性斑块状银屑病的全身药理学治疗:一项网状荟萃分析。
Cochrane Database Syst Rev. 2017 Dec 22;12(12):CD011535. doi: 10.1002/14651858.CD011535.pub2.
7
Short-Term Memory Impairment短期记忆障碍
8
Systemic Inflammatory Response Syndrome全身炎症反应综合征
9
Drugs for preventing postoperative nausea and vomiting in adults after general anaesthesia: a network meta-analysis.成人全身麻醉后预防术后恶心呕吐的药物:网状Meta分析
Cochrane Database Syst Rev. 2020 Oct 19;10(10):CD012859. doi: 10.1002/14651858.CD012859.pub2.
10
Home treatment for mental health problems: a systematic review.心理健康问题的居家治疗:一项系统综述
Health Technol Assess. 2001;5(15):1-139. doi: 10.3310/hta5150.

本文引用的文献

1
K2R: Tinted de Bruijn graphs implementation for efficient read extraction from sequencing datasets.K2R:用于从测序数据集中高效提取 reads 的带颜色的德布鲁因图实现。
Bioinform Adv. 2025 May 14;5(1):vbaf111. doi: 10.1093/bioadv/vbaf111. eCollection 2025.
2
Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs.模式所在:带重复感知的彩色 de Bruijn 图压缩。
J Comput Biol. 2024 Oct;31(10):1022-1044. doi: 10.1089/cmb.2024.0714. Epub 2024 Oct 9.
3
Indexing and searching petabase-scale nucleotide resources.
对 petabase 规模的核苷酸资源进行索引和搜索。
Nat Methods. 2024 Jun;21(6):994-1002. doi: 10.1038/s41592-024-02280-z. Epub 2024 May 16.
4
Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA.使用 kmindex 和 ORA 在 TB 级别的复杂基因组数据集上进行索引和实时用户友好查询。
Nat Comput Sci. 2024 Feb;4(2):104-109. doi: 10.1038/s43588-024-00596-6. Epub 2024 Feb 26.
5
Fulgor: a fast and compact k-mer index for large-scale matching and color queries.Fulgor:一种用于大规模匹配和颜色查询的快速紧凑的k-mer索引。
Algorithms Mol Biol. 2024 Jan 22;19(1):3. doi: 10.1186/s13015-024-00251-9.
6
Scalable sequence database search using partitioned aggregated Bloom comb trees.基于分区聚合布隆过滤树的可扩展序列数据库搜索。
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i252-i259. doi: 10.1093/bioinformatics/btad225.
7
Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. Themisto:一种可扩展的彩色 k-mer 索引,可用于对数十万细菌基因组进行敏感的伪比对。
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i260-i269. doi: 10.1093/bioinformatics/btad233.
8
Hierarchical Interleaved Bloom Filter: enabling ultrafast, approximate sequence queries.分层交错布隆过滤器:实现超快速、近似的序列查询。
Genome Biol. 2023 May 31;24(1):131. doi: 10.1186/s13059-023-02971-4.
9
MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants.MetaProFi:一种超快的分块布隆过滤器,用于存储和查询蛋白质和核苷酸序列数据,以准确识别功能相关的遗传变异。
Bioinformatics. 2023 Mar 1;39(3). doi: 10.1093/bioinformatics/btad101.
10
kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections.kmtricks:用于大型测序数据集的布隆过滤器的高效灵活构建
Bioinform Adv. 2022 Apr 29;2(1):vbac029. doi: 10.1093/bioadv/vbac029. eCollection 2022.