School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, 30332, GA, USA.
School of Software, Shandong University, Shunhua Road 1500, Jinan, Shandong, China.
BMC Bioinformatics. 2018 Mar 9;19(1):92. doi: 10.1186/s12859-018-2094-5.
Various indexing techniques have been applied by next generation sequencing read mapping tools. The choice of a particular data structure is a trade-off between memory consumption, mapping throughput, and construction time.
We present the succinct hash index - a novel data structure for read mapping which is a variant of the classical q-gram index with a particularly small memory footprint occupying between 3.5 and 5.3 GB for a human reference genome for typical parameter settings. The succinct hash index features two novel seed selection algorithms (group seeding and variable-length seeding) and an efficient parallel construction algorithm, which we have implemented to design the FEM (Fast(F) and Efficient(E) read Mapper(M)) mapper. FEM can return all read mappings within a given edit distance. Our experimental results show that FEM is scalable and outperforms other state-of-the-art all-mappers in terms of both speed and memory footprint. Compared to Masai, FEM is an order-of-magnitude faster using a single thread and two orders-of-magnitude faster when using multiple threads. Furthermore, we observe an up to 2.8-fold speedup compared to BitMapper and an order-of-magnitude speedup compared to BitMapper2 and Hobbes3.
The presented succinct index is the first feasible implementation of the q-gram index functionality that occupies around 3.5 GB of memory for a whole human reference genome. FEM is freely available at https://github.com/haowenz/FEM .
下一代测序读映射工具应用了各种索引技术。特定数据结构的选择是在内存消耗、映射吞吐量和构建时间之间进行权衡。
我们提出了简洁哈希索引 - 一种用于读映射的新型数据结构,它是经典 q-gram 索引的变体,对于典型参数设置,对于人类参考基因组,其内存占用量特别小,介于 3.5 和 5.3GB 之间。简洁哈希索引具有两种新颖的种子选择算法(分组种子选择和可变长度种子选择)和一种高效的并行构建算法,我们已将其实现用于设计 FEM(快速(F)和高效(E)读映射器(M))映射器。FEM 可以在给定编辑距离内返回所有读映射。我们的实验结果表明,FEM 是可扩展的,在速度和内存占用方面均优于其他最先进的全映射器。与 Masai 相比,FEM 使用单个线程时快一个数量级,使用多个线程时快两个数量级。此外,与 BitMapper 相比,我们观察到高达 2.8 倍的速度提升,与 BitMapper2 和 Hobbes3 相比,速度提升一个数量级。
所提出的简洁索引是第一个可行的 q-gram 索引功能实现,对于整个人类参考基因组,它占用约 3.5GB 的内存。FEM 可在 https://github.com/haowenz/FEM 上免费获得。