Zhang Tong, Yin Zekun, Xu Xiaoming, Yan Lifeng, Zhu Fangjin, Duan Xiaohui, Schmidt Bertil, Liu Weiguo
School of Software, Shandong University, Jinan 250101, China.
Institute for Computer Science, Johannes Gutenberg University, Mainz 55128, Germany.
Bioinformatics. 2025 May 6;41(5). doi: 10.1093/bioinformatics/btaf249.
We present RabbitSketch, a highly optimized library of sketching algorithms such as MinHash, OrderMinHash, and HyperLogLog that can exploit the power of modern multi-core CPUs. It provides significant speedups compared to existing implementations, ranging from 2.30× to 49.55×, as well as flexible and easy-to-use interfaces for both Python and C++. As a result, the similarity analysis of 455GB genomic data can be completed in only 5 minutes using RabbitSketch with merely 20 lines of Python code. As a case study, we enhanced RabbitTClust by integrating RabbitSketch's Kssd algorithm, resulting in a 1.54× speedup with no loss in accuracy.
RabbitSketch is available at https://github.com/RabbitBio/RabbitSketch with an archived version at Zenodo: https://doi.org/10.5281/zenodo.14903962. Detailed API documentation is available at https://rabbitsketch.readthedocs.io/en/latest.
我们展示了RabbitSketch,这是一个高度优化的草图算法库,如MinHash、OrderMinHash和HyperLogLog,它可以利用现代多核CPU的能力。与现有实现相比,它显著提高了速度,加速比从2.30倍到49.55倍不等,并且为Python和C++提供了灵活且易于使用的接口。因此,使用RabbitSketch只需20行Python代码,就能在仅5分钟内完成455GB基因组数据的相似性分析。作为一个案例研究,我们通过集成RabbitSketch的Kssd算法增强了RabbitTClust,实现了1.54倍的加速且精度没有损失。