Suppr超能文献

基于简洁哈希索引的快速高效短读映射。

Fast and efficient short read mapping based on a succinct hash index.

机构信息

School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, 30332, GA, USA.

School of Software, Shandong University, Shunhua Road 1500, Jinan, Shandong, China.

出版信息

BMC Bioinformatics. 2018 Mar 9;19(1):92. doi: 10.1186/s12859-018-2094-5.

Abstract

BACKGROUND

Various indexing techniques have been applied by next generation sequencing read mapping tools. The choice of a particular data structure is a trade-off between memory consumption, mapping throughput, and construction time.

RESULTS

We present the succinct hash index - a novel data structure for read mapping which is a variant of the classical q-gram index with a particularly small memory footprint occupying between 3.5 and 5.3 GB for a human reference genome for typical parameter settings. The succinct hash index features two novel seed selection algorithms (group seeding and variable-length seeding) and an efficient parallel construction algorithm, which we have implemented to design the FEM (Fast(F) and Efficient(E) read Mapper(M)) mapper. FEM can return all read mappings within a given edit distance. Our experimental results show that FEM is scalable and outperforms other state-of-the-art all-mappers in terms of both speed and memory footprint. Compared to Masai, FEM is an order-of-magnitude faster using a single thread and two orders-of-magnitude faster when using multiple threads. Furthermore, we observe an up to 2.8-fold speedup compared to BitMapper and an order-of-magnitude speedup compared to BitMapper2 and Hobbes3.

CONCLUSIONS

The presented succinct index is the first feasible implementation of the q-gram index functionality that occupies around 3.5 GB of memory for a whole human reference genome. FEM is freely available at https://github.com/haowenz/FEM .

摘要

背景

下一代测序读映射工具应用了各种索引技术。特定数据结构的选择是在内存消耗、映射吞吐量和构建时间之间进行权衡。

结果

我们提出了简洁哈希索引 - 一种用于读映射的新型数据结构,它是经典 q-gram 索引的变体,对于典型参数设置,对于人类参考基因组,其内存占用量特别小,介于 3.5 和 5.3GB 之间。简洁哈希索引具有两种新颖的种子选择算法(分组种子选择和可变长度种子选择)和一种高效的并行构建算法,我们已将其实现用于设计 FEM(快速(F)和高效(E)读映射器(M))映射器。FEM 可以在给定编辑距离内返回所有读映射。我们的实验结果表明,FEM 是可扩展的,在速度和内存占用方面均优于其他最先进的全映射器。与 Masai 相比,FEM 使用单个线程时快一个数量级,使用多个线程时快两个数量级。此外,与 BitMapper 相比,我们观察到高达 2.8 倍的速度提升,与 BitMapper2 和 Hobbes3 相比,速度提升一个数量级。

结论

所提出的简洁索引是第一个可行的 q-gram 索引功能实现,对于整个人类参考基因组,它占用约 3.5GB 的内存。FEM 可在 https://github.com/haowenz/FEM 上免费获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c591/5845352/ce8ed51692b8/12859_2018_2094_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验