• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

RabbitSketch:用于基因组分析的高性能草图绘制库。

RabbitSketch: a high-performance sketching library for genome analysis.

作者信息

Zhang Tong, Yin Zekun, Xu Xiaoming, Yan Lifeng, Zhu Fangjin, Duan Xiaohui, Schmidt Bertil, Liu Weiguo

机构信息

School of Software, Shandong University, Jinan 250101, China.

Institute for Computer Science, Johannes Gutenberg University, Mainz 55128, Germany.

出版信息

Bioinformatics. 2025 May 6;41(5). doi: 10.1093/bioinformatics/btaf249.

DOI:10.1093/bioinformatics/btaf249
PMID:40286290
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12054975/
Abstract

SUMMARY

We present RabbitSketch, a highly optimized library of sketching algorithms such as MinHash, OrderMinHash, and HyperLogLog that can exploit the power of modern multi-core CPUs. It provides significant speedups compared to existing implementations, ranging from 2.30× to 49.55×, as well as flexible and easy-to-use interfaces for both Python and C++. As a result, the similarity analysis of 455GB genomic data can be completed in only 5 minutes using RabbitSketch with merely 20 lines of Python code. As a case study, we enhanced RabbitTClust by integrating RabbitSketch's Kssd algorithm, resulting in a 1.54× speedup with no loss in accuracy.

AVAILABILITY AND IMPLEMENTATION

RabbitSketch is available at https://github.com/RabbitBio/RabbitSketch with an archived version at Zenodo: https://doi.org/10.5281/zenodo.14903962. Detailed API documentation is available at https://rabbitsketch.readthedocs.io/en/latest.

摘要

摘要

我们展示了RabbitSketch,这是一个高度优化的草图算法库,如MinHash、OrderMinHash和HyperLogLog,它可以利用现代多核CPU的能力。与现有实现相比,它显著提高了速度,加速比从2.30倍到49.55倍不等,并且为Python和C++提供了灵活且易于使用的接口。因此,使用RabbitSketch只需20行Python代码,就能在仅5分钟内完成455GB基因组数据的相似性分析。作为一个案例研究,我们通过集成RabbitSketch的Kssd算法增强了RabbitTClust,实现了1.54倍的加速且精度没有损失。

可用性和实现方式

RabbitSketch可在https://github.com/RabbitBio/RabbitSketch获取,其存档版本在Zenodo:https://doi.org/10.5281/zenodo.14903962。详细的API文档可在https://rabbitsketch.readthedocs.io/en/latest获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e553/12054975/92395eda0d69/btaf249f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e553/12054975/92395eda0d69/btaf249f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e553/12054975/92395eda0d69/btaf249f1.jpg

相似文献

1
RabbitSketch: a high-performance sketching library for genome analysis.RabbitSketch:用于基因组分析的高性能草图绘制库。
Bioinformatics. 2025 May 6;41(5). doi: 10.1093/bioinformatics/btaf249.
2
Kssdtree: an interactive Python package for phylogenetic analysis based on sketching technique.Kssdtree:一个基于草图技术的交互式 Python 包,用于进行系统发育分析。
Bioinformatics. 2024 Oct 1;40(10). doi: 10.1093/bioinformatics/btae566.
3
Genopyc: a Python library for investigating the functional effects of genomic variants associated to complex diseases.Genopyc:一个用于研究与复杂疾病相关的基因组变异的功能影响的 Python 库。
Bioinformatics. 2024 Jun 3;40(6). doi: 10.1093/bioinformatics/btae379.
4
Python interfaces for the Smoldyn simulator.Smoldyn 模拟器的 Python 接口。
Bioinformatics. 2021 Dec 22;38(1):291-293. doi: 10.1093/bioinformatics/btab530.
5
Scbean: a python library for single-cell multi-omics data analysis.Scbean:一个用于单细胞多组学数据分析的 Python 库。
Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae053.
6
PyHMMER: a Python library binding to HMMER for efficient sequence analysis.PyHMMER:一个绑定到 HMMER 的 Python 库,用于高效的序列分析。
Bioinformatics. 2023 May 4;39(5). doi: 10.1093/bioinformatics/btad214.
7
ChromaX: a fast and scalable breeding program simulator.ChromaX:一款快速且可扩展的育种计划模拟器。
Bioinformatics. 2023 Dec 1;39(12). doi: 10.1093/bioinformatics/btad691.
8
RabbitKSSD: accelerating genome distance estimation on modern multi-core architectures.兔斯基 KSSD:在现代多核架构上加速基因组距离估计。
Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad695.
9
Gos: a declarative library for interactive genomics visualization in Python.Gos:一个用于 Python 中交互式基因组学可视化的声明式库。
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btad050.
10
Gene count estimation with pytximport enables reproducible analysis of bulk RNA sequencing data in Python.使用pytximport进行基因计数估计能够在Python中对批量RNA测序数据进行可重复分析。
Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae700.

本文引用的文献

1
CUDASW++4.0: ultra-fast GPU-based Smith-Waterman protein sequence database search.CUDASW++4.0:基于 GPU 的超快 Smith-Waterman 蛋白质序列数据库搜索。
BMC Bioinformatics. 2024 Nov 2;25(1):342. doi: 10.1186/s12859-024-05965-6.
2
RabbitKSSD: accelerating genome distance estimation on modern multi-core architectures.兔斯基 KSSD:在现代多核架构上加速基因组距离估计。
Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad695.
3
Genomic sketching with multiplicities and locality-sensitive hashing using Dashing 2.
使用 Dashing 2 进行多重性和位置敏感哈希的基因组草图绘制。
Genome Res. 2023 Jul;33(7):1218-1227. doi: 10.1101/gr.277655.123. Epub 2023 Jul 6.
4
RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches.RabbitTClust:使用 MinHash 草图实现对数百万个细菌基因组的快速聚类分析。
Genome Biol. 2023 May 17;24(1):121. doi: 10.1186/s13059-023-02961-6.
5
RabbitFX: Efficient Framework for FASTA/Q File Parsing on Modern Multi-Core Platforms.RabbitFX:适用于现代多核平台的 FASTA/Q 文件解析的高效框架。
IEEE/ACM Trans Comput Biol Bioinform. 2023 May-Jun;20(3):2341-2348. doi: 10.1109/TCBB.2022.3219114. Epub 2023 Jun 5.
6
Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis.Kssd:通过 K-mer 子串空间采样进行序列降维,实现实时大规模数据集分析。
Genome Biol. 2021 Mar 16;22(1):84. doi: 10.1186/s13059-021-02303-4.
7
RabbitMash: accelerating hash-based genome analysis on modern multi-core architectures.兔 mash:加速基于哈希的现代多核架构上的基因组分析。
Bioinformatics. 2021 May 5;37(6):873-875. doi: 10.1093/bioinformatics/btaa754.
8
To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics.从 PB 级到更多:概率和信号处理算法的最新进展及其在宏基因组学中的应用。
Nucleic Acids Res. 2020 Jun 4;48(10):5217-5234. doi: 10.1093/nar/gkaa265.
9
Dashing: fast and accurate genomic distances with HyperLogLog.使用 HyperLogLog 实现快速准确的基因组距离计算。
Genome Biol. 2019 Dec 4;20(1):265. doi: 10.1186/s13059-019-1875-0.
10
When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data.决堤之时:算法速写实用指南,助你应对基因组洪流。
Genome Biol. 2019 Sep 13;20(1):199. doi: 10.1186/s13059-019-1809-x.