School of Software, Shandong University, Jinan 250101, China.
Shenzhen Research Institute of Shandong University, Shenzhen 518063, China.
Bioinformatics. 2021 May 5;37(6):873-875. doi: 10.1093/bioinformatics/btaa754.
Mash is a popular hash-based genome analysis toolkit with applications to important downstream analyses tasks such as clustering and assembly. However, Mash is currently not able to fully exploit the capabilities of modern multi-core architectures, which in turn leads to high runtimes for large-scale genomic datasets.
We present RabbitMash, an efficient highly optimized implementation of Mash which can take full advantage of modern hardware including multi-threading, vectorization and fast I/O. We show that our approach achieves speedups of at least 1.3, 9.8, 8.5 and 4.4 compared to Mash for the operations sketch, dist, triangle and screen, respectively. Furthermore, RabbitMash is able to compute the all-versus-all distances of 100 321 genomes in <5 min on a 40-core workstation while Mash requires over 40 min.
RabbitMash is available at https://github.com/ZekunYin/RabbitMash.
Supplementary data are available at Bioinformatics online.
Mash 是一个流行的基于哈希的基因组分析工具包,可应用于聚类和组装等重要的下游分析任务。然而,Mash 目前还不能充分利用现代多核架构的功能,这反过来又导致大规模基因组数据集的运行时间很长。
我们提出了 RabbitMash,这是一种高效的 Mash 高度优化的实现,可以充分利用现代硬件,包括多线程、向量化和快速 I/O。我们表明,与 Mash 相比,我们的方法在操作草图、距离、三角形和屏幕方面分别实现了至少 1.3、9.8、8.5 和 4.4 的加速。此外,RabbitMash 能够在一个 40 核工作站上计算 100321 个基因组的全对全距离,耗时不到 5 分钟,而 Mash 则需要 40 多分钟。
RabbitMash 可在 https://github.com/ZekunYin/RabbitMash 获得。
补充数据可在“Bioinformatics”在线获得。