Depts. of Computer Science and Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
BMC Genomics. 2013;14 Suppl 1(Suppl 1):S13. doi: 10.1186/1471-2164-14-S1-S13. Epub 2013 Jan 21.
With the introduction of next-generation sequencing (NGS) technologies, we are facing an exponential increase in the amount of genomic sequence data. The success of all medical and genetic applications of next-generation sequencing critically depends on the existence of computational techniques that can process and analyze the enormous amount of sequence data quickly and accurately. Unfortunately, the current read mapping algorithms have difficulties in coping with the massive amounts of data generated by NGS.We propose a new algorithm, FastHASH, which drastically improves the performance of the seed-and-extend type hash table based read mapping algorithms, while maintaining the high sensitivity and comprehensiveness of such methods. FastHASH is a generic algorithm compatible with all seed-and-extend class read mapping algorithms. It introduces two main techniques, namely Adjacency Filtering, and Cheap K-mer Selection.We implemented FastHASH and merged it into the codebase of the popular read mapping program, mrFAST. Depending on the edit distance cutoffs, we observed up to 19-fold speedup while still maintaining 100% sensitivity and high comprehensiveness.
随着下一代测序 (NGS) 技术的引入,我们正面临着基因组序列数据量的指数级增长。下一代测序在医学和遗传学方面的所有应用的成功都严重依赖于能够快速准确地处理和分析大量序列数据的计算技术。不幸的是,当前的读映射算法在处理 NGS 产生的大量数据时遇到了困难。我们提出了一种新的算法 FastHASH,它极大地提高了基于种子和扩展的哈希表的读映射算法的性能,同时保持了这些方法的高灵敏度和全面性。FastHASH 是一种与所有种子和扩展类读映射算法兼容的通用算法。它引入了两种主要技术,即邻域过滤和廉价的 K-mer 选择。我们实现了 FastHASH 并将其合并到流行的读映射程序 mrFAST 的代码库中。根据编辑距离的截止值,我们观察到速度提高了 19 倍,同时仍然保持了 100%的灵敏度和高度的全面性。