Seiler Enrico, Mehringer Svenja, Darvish Mitra, Turc Etienne, Reinert Knut
Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin, Germany.
Efficient Algorithms for Omics Data, Max Planck Institute for Molecular Genetics, Berlin, Germany.
iScience. 2021 Jun 24;24(7):102782. doi: 10.1016/j.isci.2021.102782. eCollection 2021 Jul 23.
We present Raptor, a system for approximately searching many queries such as next-generation sequencing reads or transcripts in large collections of nucleotide sequences. Raptor uses winnowing minimizers to define a set of representative -mers, an extension of the interleaved Bloom filters (IBFs) as a set membership data structure and probabilistic thresholding for minimizers. Our approach allows compression and partitioning of the IBF to enable the effective use of secondary memory. We test and show the performance and limitations of the new features using simulated and real datasets. Our data structure can be used to accelerate various core bioinformatics applications. We show this by re-implementing the distributed read mapping tool DREAM-Yara.
我们介绍了Raptor,这是一个用于在大量核苷酸序列集合中近似搜索许多查询(如下一代测序读数或转录本)的系统。Raptor使用滑动窗口最小化器来定义一组代表性的k-mer,将交错布隆过滤器(IBF)扩展为一种集合成员数据结构,并对最小化器进行概率阈值处理。我们的方法允许对IBF进行压缩和分区,以有效利用二级存储器。我们使用模拟和真实数据集测试并展示了这些新特性的性能和局限性。我们的数据结构可用于加速各种核心生物信息学应用。我们通过重新实现分布式读映射工具DREAM-Yara来证明这一点。