Computer Science Department, Stony Brook University, 100 Nicolls Rd, Stony Brook, NY 11794, USA.
Computer Science Department, Stony Brook University, 100 Nicolls Rd, Stony Brook, NY 11794, USA; VMware Research, 3425 Hillview Ave, Palo Alto, CA 94304, USA.
Cell Syst. 2018 Aug 22;7(2):201-207.e4. doi: 10.1016/j.cels.2018.05.021. Epub 2018 Jun 20.
Sequence-level searches on large collections of RNA sequencing experiments, such as the NCBI Sequence Read Archive (SRA), would enable one to ask many questions about the expression or variation of a given transcript in a population. Existing approaches, such as the sequence Bloom tree, suffer from fundamental limitations of the Bloom filter, resulting in slow build and query times, less-than-optimal space usage, and potentially large numbers of false-positives. This paper introduces Mantis, a space-efficient system that uses new data structures to index thousands of raw-read experiments and facilitates large-scale sequence searches. In our evaluation, index construction with Mantis is 6× faster and yields a 20% smaller index than the state-of-the-art split sequence Bloom tree (SSBT). For queries, Mantis is 6-108× faster than SSBT and has no false-positives or -negatives. For example, Mantis was able to search for all 200,400 known human transcripts in an index of 2,652 RNA sequencing experiments in 82 min; SSBT took close to 4 days.
在大型 RNA 测序实验(如 NCBI Sequence Read Archive [SRA])集合上进行序列级搜索,将使人们能够提出关于给定转录本在群体中的表达或变异的许多问题。现有的方法,如序列布隆树,受到布隆过滤器的根本限制,导致构建和查询时间缓慢、空间使用效率低于最佳、并且可能存在大量的假阳性。本文介绍了 Mantis,这是一种空间高效的系统,它使用新的数据结构来索引数千个原始读取实验,并促进大规模的序列搜索。在我们的评估中,Mantis 的索引构建速度比最先进的分割序列布隆树(SSBT)快 6 倍,并且生成的索引小 20%。对于查询,Mantis 比 SSBT 快 6-108 倍,并且没有假阳性或假阴性。例如,Mantis 能够在 82 分钟内搜索到索引中 2652 个 RNA 测序实验中的 200400 个已知人类转录本;SSBT 则需要将近 4 天的时间。