Almodaresi Fatemeh, Khan Jamshed, Madaminov Sergey, Ferdman Michael, Johnson Rob, Pandey Prashant, Patro Rob
Department of Computer Science, University of Maryland, USA.
Department of Computer Science, Stony Brook University, USA.
Bioinformatics. 2022 Jun 13;38(12):3155-3163. doi: 10.1093/bioinformatics/btac142.
In the past few years, researchers have proposed numerous indexing schemes for searching large datasets of raw sequencing experiments. Most of these proposed indexes are approximate (i.e. with one-sided errors) in order to save space. Recently, researchers have published exact indexes-Mantis, VariMerge and Bifrost-that can serve as colored de Bruijn graph representations in addition to serving as k-mer indexes. This new type of index is promising because it has the potential to support more complex analyses than simple searches. However, in order to be useful as indexes for large and growing repositories of raw sequencing data, they must scale to thousands of experiments and support efficient insertion of new data.
In this paper, we show how to build a scalable and updatable exact raw sequence-search index. Specifically, we extend Mantis using the Bentley-Saxe transformation to support efficient updates, called Dynamic Mantis. We demonstrate Dynamic Mantis's scalability by constructing an index of ≈40K samples from SRA by adding samples one at a time to an initial index of 10K samples. Compared to VariMerge and Bifrost, Dynamic Mantis is more efficient in terms of index-construction time and memory, query time and memory and index size. In our benchmarks, VariMerge and Bifrost scaled to only 5K and 80 samples, respectively, while Dynamic Mantis scaled to more than 39K samples. Queries were over 24× faster in Mantis than in Bifrost (VariMerge does not immediately support general search queries we require). Dynamic Mantis indexes were about 2.5× smaller than Bifrost's indexes and about half as big as VariMerge's indexes.
Dynamic Mantis implementation is available at https://github.com/splatlab/mantis/tree/mergeMSTs.
Supplementary data are available at Bioinformatics online.
在过去几年中,研究人员针对大型原始测序实验数据集的搜索提出了众多索引方案。为了节省空间,这些提出的索引大多是近似的(即存在单边误差)。最近,研究人员发布了精确索引——Mantis、VariMerge和Bifrost——它们除了可作为k-mer索引外,还能作为彩色德布鲁因图表示。这种新型索引很有前景,因为它有可能支持比简单搜索更复杂的分析。然而,为了能作为大型且不断增长的原始测序数据存储库的索引,它们必须能够扩展到数千个实验,并支持高效插入新数据。
在本文中,我们展示了如何构建一个可扩展且可更新的精确原始序列搜索索引。具体而言,我们使用Bentley-Saxe变换扩展Mantis以支持高效更新,称为动态Mantis。我们通过将样本逐个添加到10K样本的初始索引中,构建了一个来自SRA的约40K样本的索引,以此证明动态Mantis的可扩展性。与VariMerge和Bifrost相比,动态Mantis在索引构建时间和内存、查询时间和内存以及索引大小方面更高效。在我们的基准测试中,VariMerge和Bifrost分别仅能扩展到5K和80个样本,而动态Mantis可扩展到超过39K个样本。Mantis中的查询速度比Bifrost快24倍以上(VariMerge不立即支持我们所需的一般搜索查询)。动态Mantis索引比Bifrost索引小约2.5倍,约为VariMerge索引大小的一半。
动态Mantis的实现可在https://github.com/splatlab/mantis/tree/mergeMSTs获取。
补充数据可在《生物信息学》在线获取。