• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

猛禽:一种用于查询超大型核苷酸序列集合的快速且节省空间的预过滤器。

Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences.

作者信息

Seiler Enrico, Mehringer Svenja, Darvish Mitra, Turc Etienne, Reinert Knut

机构信息

Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany.

Efficient Algorithms for Omics Data, Max Planck Institute for Molecular Genetics, Berlin, Germany.

出版信息

iScience. 2021 Jun 24;24(7):102782. doi: 10.1016/j.isci.2021.102782. eCollection 2021 Jul 23.

DOI:10.1016/j.isci.2021.102782
PMID:34337360
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8313605/
Abstract

We present Raptor, a system for approximately searching many queries such as next-generation sequencing reads or transcripts in large collections of nucleotide sequences. Raptor uses winnowing minimizers to define a set of representative -mers, an extension of the interleaved Bloom filters (IBFs) as a set membership data structure and probabilistic thresholding for minimizers. Our approach allows compression and partitioning of the IBF to enable the effective use of secondary memory. We test and show the performance and limitations of the new features using simulated and real datasets. Our data structure can be used to accelerate various core bioinformatics applications. We show this by re-implementing the distributed read mapping tool DREAM-Yara.

摘要

我们介绍了Raptor,这是一个用于在大量核苷酸序列集合中近似搜索许多查询(如下一代测序读数或转录本)的系统。Raptor使用滑动窗口最小化器来定义一组代表性的k-mer,将交错布隆过滤器(IBF)扩展为一种集合成员数据结构,并对最小化器进行概率阈值处理。我们的方法允许对IBF进行压缩和分区,以有效利用二级存储器。我们使用模拟和真实数据集测试并展示了这些新特性的性能和局限性。我们的数据结构可用于加速各种核心生物信息学应用。我们通过重新实现分布式读映射工具DREAM-Yara来证明这一点。

相似文献

1
Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences.猛禽:一种用于查询超大型核苷酸序列集合的快速且节省空间的预过滤器。
iScience. 2021 Jun 24;24(7):102782. doi: 10.1016/j.isci.2021.102782. eCollection 2021 Jul 23.
2
Querying large read collections in main memory: a versatile data structure.在主内存中查询大型读取集合:一种通用的数据结构。
BMC Bioinformatics. 2011 Jun 17;12:242. doi: 10.1186/1471-2105-12-242.
3
DREAM-Yara: an exact read mapper for very large databases with short update time.DREAM-Yara:适用于具有较短更新时间的大型数据库的精确读取映射器。
Bioinformatics. 2018 Sep 1;34(17):i766-i772. doi: 10.1093/bioinformatics/bty567.
4
kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections.kmtricks:用于大型测序数据集的布隆过滤器的高效灵活构建
Bioinform Adv. 2022 Apr 29;2(1):vbac029. doi: 10.1093/bioadv/vbac029. eCollection 2022.
5
Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage.布隆过滤器前缀树:一种用于泛基因组存储的无比对和无参考的数据结构。
Algorithms Mol Biol. 2016 Apr 14;11:3. doi: 10.1186/s13015-016-0066-8. eCollection 2016.
6
Fulgor: A fast and compact -mer index for large-scale matching and color queries.富尔戈尔:一种用于大规模匹配和颜色查询的快速紧凑的k-mer索引。
bioRxiv. 2023 May 20:2023.05.09.539895. doi: 10.1101/2023.05.09.539895.
7
Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.使用k-mer布隆过滤器提高序列数据上的布隆过滤器性能。
J Comput Biol. 2017 Jun;24(6):547-557. doi: 10.1089/cmb.2016.0155. Epub 2016 Nov 9.
8
Hierarchical Interleaved Bloom Filter: enabling ultrafast, approximate sequence queries.分层交错布隆过滤器:实现超快速、近似的序列查询。
Genome Biol. 2023 May 31;24(1):131. doi: 10.1186/s13059-023-02971-4.
9
Konnector v2.0: pseudo-long reads from paired-end sequencing data.Konnector v2.0:来自双端测序数据的伪长读段
BMC Med Genomics. 2015;8 Suppl 3(Suppl 3):S1. doi: 10.1186/1755-8794-8-S3-S1. Epub 2015 Sep 23.
10
Improving the performance of minimizers and winnowing schemes.提高最小化器和淘汰方案的性能。
Bioinformatics. 2017 Jul 15;33(14):i110-i117. doi: 10.1093/bioinformatics/btx235.

引用本文的文献

1
ganon2: up-to-date and scalable metagenomics analysis.Ganon2:最新且可扩展的宏基因组学分析。
NAR Genom Bioinform. 2025 Jul 17;7(3):lqaf094. doi: 10.1093/nargab/lqaf094. eCollection 2025 Sep.
2
Kaminari: a resource-frugal index for approximate colored -mer queries.电雷:一种用于近似彩色k-mer查询的资源节约型索引。
bioRxiv. 2025 May 21:2025.05.16.654317. doi: 10.1101/2025.05.16.654317.
3
TetRex: a novel algorithm for index-accelerated search of highly conserved motifs.TetRex:一种用于高度保守基序索引加速搜索的新算法。

本文引用的文献

1
Data structures based on -mers for querying large collections of sequencing data sets.基于 - 元的序列数据集查询的大型数据集的数据结构。
Genome Res. 2021 Jan;31(1):1-12. doi: 10.1101/gr.260604.119. Epub 2020 Dec 16.
2
REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets.驯鹿:测序数据集中小段序列存在和丰度的高效索引。
Bioinformatics. 2020 Jul 1;36(Suppl_1):i177-i185. doi: 10.1093/bioinformatics/btaa487.
3
ganon: precise metagenomics classification against large and up-to-date sets of reference sequences.
NAR Genom Bioinform. 2025 Apr 17;7(2):lqaf039. doi: 10.1093/nargab/lqaf039. eCollection 2025 Jun.
4
Lambda3: homology search for protein, nucleotide, and bisulfite-converted sequences.Lambda3:蛋白质、核苷酸和亚硫酸氢盐转化序列的同源性搜索。
Bioinformatics. 2024 Mar 4;40(3). doi: 10.1093/bioinformatics/btae097.
5
Creating and Using Minimizer Sketches in Computational Genomics.在计算基因组学中创建和使用最小草图。
J Comput Biol. 2023 Dec;30(12):1251-1276. doi: 10.1089/cmb.2023.0094. Epub 2023 Aug 30.
6
Hierarchical Interleaved Bloom Filter: enabling ultrafast, approximate sequence queries.分层交错布隆过滤器:实现超快速、近似的序列查询。
Genome Biol. 2023 May 31;24(1):131. doi: 10.1186/s13059-023-02971-4.
7
Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments.针:一种快速且节省空间的预过滤器,用于估计大量表达实验的定量。
Bioinformatics. 2022 Sep 2;38(17):4100-4108. doi: 10.1093/bioinformatics/btac492.
ganon:针对大型且最新的参考序列集进行精确的宏基因组分类。
Bioinformatics. 2020 Jul 1;36(Suppl_1):i12-i20. doi: 10.1093/bioinformatics/btaa458.
4
Ultrafast search of all deposited bacterial and viral genomic data.快速搜索所有已存入的细菌和病毒基因组数据。
Nat Biotechnol. 2019 Feb;37(2):152-159. doi: 10.1038/s41587-018-0010-1. Epub 2019 Feb 4.
5
DREAM-Yara: an exact read mapper for very large databases with short update time.DREAM-Yara:适用于具有较短更新时间的大型数据库的精确读取映射器。
Bioinformatics. 2018 Sep 1;34(17):i766-i772. doi: 10.1093/bioinformatics/bty567.
6
Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index.螳螂:一种快速、小巧、精确的大规模序列搜索索引。
Cell Syst. 2018 Aug 22;7(2):201-207.e4. doi: 10.1016/j.cels.2018.05.021. Epub 2018 Jun 20.
7
AllSome Sequence Bloom Trees.所有一些序列布隆树。
J Comput Biol. 2018 May;25(5):467-479. doi: 10.1089/cmb.2017.0258. Epub 2018 Apr 5.
8
The SeqAn C++ template library for efficient sequence analysis: A resource for programmers.SeqAn C++ 模板库用于高效的序列分析:面向程序员的资源。
J Biotechnol. 2017 Nov 10;261:157-168. doi: 10.1016/j.jbiotec.2017.07.017. Epub 2017 Sep 6.
9
Improving the performance of minimizers and winnowing schemes.提高最小化器和淘汰方案的性能。
Bioinformatics. 2017 Jul 15;33(14):i110-i117. doi: 10.1093/bioinformatics/btx235.
10
Centrifuge: rapid and sensitive classification of metagenomic sequences.离心机:宏基因组序列的快速灵敏分类
Genome Res. 2016 Dec;26(12):1721-1729. doi: 10.1101/gr.210641.116. Epub 2016 Oct 17.