• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于分区聚合布隆过滤树的可扩展序列数据库搜索。

Scalable sequence database search using partitioned aggregated Bloom comb trees.

机构信息

University of Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France.

出版信息

Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i252-i259. doi: 10.1093/bioinformatics/btad225.

DOI:10.1093/bioinformatics/btad225
PMID:37387170
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10311332/
Abstract

MOTIVATION

The Sequence Read Archive public database has reached 45 petabytes of raw sequences and doubles its nucleotide content every 2 years. Although BLAST-like methods can routinely search for a sequence in a small collection of genomes, making searchable immense public resources accessible is beyond the reach of alignment-based strategies. In recent years, abundant literature tackled the task of finding a sequence in extensive sequence collections using k-mer-based strategies. At present, the most scalable methods are approximate membership query data structures that combine the ability to query small signatures or variants while being scalable to collections up to 10 000 eukaryotic samples. Results. Here, we present PAC, a novel approximate membership query data structure for querying collections of sequence datasets. PAC index construction works in a streaming fashion without any disk footprint besides the index itself. It shows a 3-6 fold improvement in construction time compared to other compressed methods for comparable index size. A PAC query can need single random access and be performed in constant time in favorable instances. Using limited computation resources, we built PAC for very large collections. They include 32 000 human RNA-seq samples in 5 days, the entire GenBank bacterial genome collection in a single day for an index size of 3.5 TB. The latter is, to our knowledge, the largest sequence collection ever indexed using an approximate membership query structure. We also showed that PAC's ability to query 500 000 transcript sequences in less than an hour.

AVAILABILITY AND IMPLEMENTATION

PAC's open-source software is available at https://github.com/Malfoy/PAC.

摘要

动机

序列读取档案公共数据库已达到 45 拍字节的原始序列,并且每两年核苷酸含量就会翻一番。虽然类似于 BLAST 的方法可以常规地在一小部分基因组中搜索序列,但要使庞大的公共资源可搜索,基于比对的策略是无法实现的。近年来,大量文献使用基于 k-mer 的策略解决了在广泛的序列集合中查找序列的任务。目前,最具可扩展性的方法是近似成员查询数据结构,它结合了查询小签名或变体的能力,同时可扩展到多达 10000 个真核样本的集合。结果。在这里,我们提出了 PAC,一种用于查询序列数据集集合的新的近似成员查询数据结构。PAC 索引构建以流的方式工作,除了索引本身之外,不需要任何磁盘占用。与其他压缩方法相比,在可比索引大小下,它的构建时间提高了 3-6 倍。PAC 查询只需要一次随机访问,并且在有利的情况下可以在常数时间内执行。使用有限的计算资源,我们为非常大的集合构建了 PAC。它们包括 32000 个人类 RNA-seq 样本,在 5 天内完成,整个 GenBank 细菌基因组集合在单个索引大小为 3.5 TB 的情况下完成。据我们所知,这是使用近似成员查询结构索引的最大序列集合。我们还表明,PAC 能够在不到一个小时的时间内查询 500000 个转录序列。

可用性和实现

PAC 的开源软件可在 https://github.com/Malfoy/PAC 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/122c/10311332/16cd84990589/btad225f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/122c/10311332/16cd84990589/btad225f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/122c/10311332/16cd84990589/btad225f1.jpg

相似文献

1
Scalable sequence database search using partitioned aggregated Bloom comb trees.基于分区聚合布隆过滤树的可扩展序列数据库搜索。
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i252-i259. doi: 10.1093/bioinformatics/btad225.
2
A space and time-efficient index for the compacted colored de Bruijn graph.一种用于压缩彩色 de Bruijn 图的空间和时间高效索引。
Bioinformatics. 2018 Jul 1;34(13):i169-i177. doi: 10.1093/bioinformatics/bty292.
3
Distributed hybrid-indexing of compressed pan-genomes for scalable and fast sequence alignment.压缩泛基因组的分布式混合索引,实现可扩展和快速的序列比对。
PLoS One. 2021 Aug 3;16(8):e0255260. doi: 10.1371/journal.pone.0255260. eCollection 2021.
4
Improved representation of sequence bloom trees.序列 Bloom 树的表示方法改进。
Bioinformatics. 2020 Feb 1;36(3):721-727. doi: 10.1093/bioinformatics/btz662.
5
Squeakr: an exact and approximate k-mer counting system.Squeakr:一种精确和近似的 k-mer 计数系统。
Bioinformatics. 2018 Feb 15;34(4):568-575. doi: 10.1093/bioinformatics/btx636.
6
Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees.使用拆分序列布隆树改进对大型转录组测序数据库的搜索
J Comput Biol. 2018 Jul;25(7):755-765. doi: 10.1089/cmb.2017.0265. Epub 2018 Mar 12.
7
Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. Themisto:一种可扩展的彩色 k-mer 索引,可用于对数十万细菌基因组进行敏感的伪比对。
Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i260-i269. doi: 10.1093/bioinformatics/btad233.
8
AllSome Sequence Bloom Trees.所有一些序列布隆树。
J Comput Biol. 2018 May;25(5):467-479. doi: 10.1089/cmb.2017.0258. Epub 2018 Apr 5.
9
SeqWare Query Engine: storing and searching sequence data in the cloud.SeqWare 查询引擎:在云端存储和搜索序列数据。
BMC Bioinformatics. 2010 Dec 21;11 Suppl 12(Suppl 12):S2. doi: 10.1186/1471-2105-11-S12-S2.
10
Rapid multiple protein sequence search by parallel and heterogeneous computation.通过并行和异构计算进行快速的多蛋白质序列搜索。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae151.

引用本文的文献

1
K2R: Tinted de Bruijn graphs implementation for efficient read extraction from sequencing datasets.K2R:用于从测序数据集中高效提取 reads 的带颜色的德布鲁因图实现。
Bioinform Adv. 2025 May 14;5(1):vbaf111. doi: 10.1093/bioadv/vbaf111. eCollection 2025.
2
Kaminari: a resource-frugal index for approximate colored -mer queries.电雷:一种用于近似彩色k-mer查询的资源节约型索引。
bioRxiv. 2025 May 21:2025.05.16.654317. doi: 10.1101/2025.05.16.654317.
3
Fractional hitting sets for efficient multiset sketching.用于高效多重集草图绘制的分数击中集

本文引用的文献

1
kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections.kmtricks:用于大型测序数据集的布隆过滤器的高效灵活构建
Bioinform Adv. 2022 Apr 29;2(1):vbac029. doi: 10.1093/bioadv/vbac029. eCollection 2022.
2
Sparse and skew hashing of K-mers.K- -mer 的稀疏和偏斜哈希。
Bioinformatics. 2022 Jun 24;38(Suppl 1):i185-i194. doi: 10.1093/bioinformatics/btac245.
3
Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences.通过对存档DNA序列的精心整理和可搜索快照探索细菌多样性。
Algorithms Mol Biol. 2025 Feb 8;20(1):1. doi: 10.1186/s13015-024-00268-0.
4
The backpack quotient filter: A dynamic and space-efficient data structure for querying -mers with abundance.背包商数过滤器:一种用于查询具有丰度的k-mers的动态且节省空间的数据结构。
iScience. 2024 Nov 23;27(12):111435. doi: 10.1016/j.isci.2024.111435. eCollection 2024 Dec 20.
5
A survey of k-mer methods and applications in bioinformatics.生物信息学中k-mer方法及其应用综述。
Comput Struct Biotechnol J. 2024 May 21;23:2289-2303. doi: 10.1016/j.csbj.2024.05.025. eCollection 2024 Dec.
6
Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA.使用 kmindex 和 ORA 在 TB 级别的复杂基因组数据集上进行索引和实时用户友好查询。
Nat Comput Sci. 2024 Feb;4(2):104-109. doi: 10.1038/s43588-024-00596-6. Epub 2024 Feb 26.
PLoS Biol. 2021 Nov 9;19(11):e3001421. doi: 10.1371/journal.pbio.3001421. eCollection 2021 Nov.
4
BLight: efficient exact associative structure for k-mers.BLight:用于k-mer的高效精确关联结构。
Bioinformatics. 2021 Sep 29;37(18):2858-2865. doi: 10.1093/bioinformatics/btab217.
5
Data structures based on -mers for querying large collections of sequencing data sets.基于 - 元的序列数据集查询的大型数据集的数据结构。
Genome Res. 2021 Jan;31(1):1-12. doi: 10.1101/gr.260604.119. Epub 2020 Dec 16.
6
Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs.Bifrost:彩色紧凑布隆图的高度并行构建和索引
Genome Biol. 2020 Sep 17;21(1):249. doi: 10.1186/s13059-020-02135-8.
7
Succinct dynamic de Bruijn graphs.简明动态布儒瓦图。
Bioinformatics. 2021 Aug 4;37(14):1946-1952. doi: 10.1093/bioinformatics/btaa546.
8
Building large updatable colored de Bruijn graphs via merging.通过合并构建大型可更新彩色 de Bruijn 图。
Bioinformatics. 2019 Jul 15;35(14):i51-i60. doi: 10.1093/bioinformatics/btz350.
9
Improved representation of sequence bloom trees.序列 Bloom 树的表示方法改进。
Bioinformatics. 2020 Feb 1;36(3):721-727. doi: 10.1093/bioinformatics/btz662.
10
Ultrafast search of all deposited bacterial and viral genomic data.快速搜索所有已存入的细菌和病毒基因组数据。
Nat Biotechnol. 2019 Feb;37(2):152-159. doi: 10.1038/s41587-018-0010-1. Epub 2019 Feb 4.