GSearch：通过组合 K -mer 哈希和分层可导航小世界图实现超快速和可扩展的基因组搜索。

GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs.

机构信息

Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, Atlanta, GA, USA.

School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA.

出版信息

Nucleic Acids Res. 2024 Sep 9;52(16):e74. doi: 10.1093/nar/gkae609.

DOI:10.1093/nar/gkae609

PMID:39011878

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11381346/

Abstract

Genome search and/or classification typically involves finding the best-match database (reference) genomes and has become increasingly challenging due to the growing number of available database genomes and the fact that traditional methods do not scale well with large databases. By combining k-mer hashing-based probabilistic data structures (i.e. ProbMinHash, SuperMinHash, Densified MinHash and SetSketch) to estimate genomic distance, with a graph based nearest neighbor search algorithm (Hierarchical Navigable Small World Graphs, or HNSW), we created a new data structure and developed an associated computer program, GSearch, that is orders of magnitude faster than alternative tools while maintaining high accuracy and low memory usage. For example, GSearch can search 8000 query genomes against all available microbial or viral genomes for their best matches (n = ∼318 000 or ∼3 000 000, respectively) within a few minutes on a personal laptop, using ∼6 GB of memory (2.5 GB via SetSketch). Notably, GSearch has an O(log(N)) time complexity and will scale well with billions of genomes based on a database splitting strategy. Further, GSearch implements a three-step search strategy depending on the degree of novelty of the query genomes to maximize specificity and sensitivity. Therefore, GSearch solves a major bottleneck of microbiome studies that require genome search and/or classification.

摘要

基因组搜索和/或分类通常涉及找到最佳匹配的数据库（参考）基因组，由于可用数据库基因组的数量不断增加，以及传统方法不适用于大型数据库的事实，这一任务变得越来越具有挑战性。我们通过结合基于 k-mer 哈希的概率数据结构（即 ProbMinHash、SuperMinHash、Densified MinHash 和 SetSketch）来估计基因组距离，并使用基于图的最近邻搜索算法（层次可导航小世界图或 HNSW），创建了一种新的数据结构，并开发了一个相关的计算机程序 GSearch，该程序的速度比替代工具快几个数量级，同时保持高精度和低内存使用。例如，GSearch 可以在个人笔记本电脑上在几分钟内搜索 8000 个查询基因组与所有可用的微生物或病毒基因组的最佳匹配（n = ∼318 000 或 ∼3 000 000，分别），使用约 6GB 的内存（通过 SetSketch 使用 2.5GB）。值得注意的是，GSearch 的时间复杂度为 O(log(N))，并且基于数据库拆分策略，可以很好地扩展到数十亿个基因组。此外，GSearch 根据查询基因组的新颖程度实施了三步搜索策略，以最大化特异性和灵敏度。因此，GSearch 解决了微生物组研究中需要基因组搜索和/或分类的主要瓶颈。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8eb3/11381346/7dbdae08766b/gkae609figgra1.jpg

相似文献

GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs.GSearch：通过组合 K -mer 哈希和分层可导航小世界图实现超快速和可扩展的基因组搜索。

Nucleic Acids Res. 2024 Sep 9;52(16):e74. doi: 10.1093/nar/gkae609.

gSearch: a fast and flexible general search tool for whole-genome sequencing.gSearch：一种快速灵活的全基因组测序通用搜索工具。

Bioinformatics. 2012 Aug 15;28(16):2176-7. doi: 10.1093/bioinformatics/bts358. Epub 2012 Jun 23.

A space and time-efficient index for the compacted colored de Bruijn graph.一种用于压缩彩色 de Bruijn 图的空间和时间高效索引。

Bioinformatics. 2018 Jul 1;34(13):i169-i177. doi: 10.1093/bioinformatics/bty292.

Genomic sketching with multiplicities and locality-sensitive hashing using Dashing 2.使用 Dashing 2 进行多重性和位置敏感哈希的基因组草图绘制。

Genome Res. 2023 Jul;33(7):1218-1227. doi: 10.1101/gr.277655.123. Epub 2023 Jul 6.

Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.使用分层可导航小世界图进行高效且鲁棒的近似最近邻搜索

IEEE Trans Pattern Anal Mach Intell. 2020 Apr;42(4):824-836. doi: 10.1109/TPAMI.2018.2889473. Epub 2018 Dec 28.

An Efficient Search Algorithm for Finding Genomic-Range Overlaps Based on the Maximum Range Length.一种基于最大范围长度寻找基因组范围重叠的高效搜索算法。

IEEE/ACM Trans Comput Biol Bioinform. 2015 Jul-Aug;12(4):778-84. doi: 10.1109/TCBB.2014.2369042.

KCOSS: an ultra-fast k-mer counter for assembled genome analysis.KCOSS：用于组装基因组分析的超快速k-mer计数器。

Bioinformatics. 2022 Jan 27;38(4):933-940. doi: 10.1093/bioinformatics/btab797.

Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections.乌贼算法：从大规模基因组集合中快速、并行且低内存消耗的 de Bruijn 图压缩。

Bioinformatics. 2021 Jul 12;37(Suppl_1):i177-i186. doi: 10.1093/bioinformatics/btab309.

Application of kernel functions for accurate similarity search in large chemical databases.核函数在大型化学数据库中精确相似性搜索的应用。

BMC Bioinformatics. 2010 Apr 29;11 Suppl 3(Suppl 3):S8. doi: 10.1186/1471-2105-11-S3-S8.

Fast detection of maximal exact matches via fixed sampling of query K-mers and Bloom filtering of index K-mers.通过查询 K -mer 的固定采样和索引 K-mer 的布隆过滤实现最大精确匹配的快速检测。

Bioinformatics. 2019 Nov 1;35(22):4560-4567. doi: 10.1093/bioinformatics/btz273.

引用本文的文献

EvANI benchmarking workflow for evolutionary distance estimation.用于进化距离估计的EvANI基准测试工作流程。

Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf267.

Promiscuous and genome-wide recombination underlies the sequence-discrete species of the SAR11 lineage in the deep ocean.杂乱且全基因组范围的重组是深海中SAR11谱系的序列离散物种的基础。

ISME J. 2025 Jan 2;19(1). doi: 10.1093/ismejo/wraf072.

EvANI benchmarking workflow for evolutionary distance estimation.用于进化距离估计的EvANI基准测试工作流程。

bioRxiv. 2025 Feb 23:2025.02.23.639716. doi: 10.1101/2025.02.23.639716.

Approximate nearest neighbor graph provides fast and efficient embedding with applications for large-scale biological data.近似最近邻图为大规模生物数据的应用提供了快速有效的嵌入。

NAR Genom Bioinform. 2024 Dec 18;6(4):lqae172. doi: 10.1093/nargab/lqae172. eCollection 2024 Dec.

Aligning distant sequences to graphs using long seed sketches.使用长种子草图对齐图上的远距离序列。

Genome Res. 2023 Jul;33(7):1208-1217. doi: 10.1101/gr.277659.123. Epub 2023 Apr 18.

本文引用的文献

Fast and robust metagenomic sequence comparison through sparse chaining with skani.通过使用 skani 进行稀疏链接实现快速稳健的宏基因组序列比较。

Nat Methods. 2023 Nov;20(11):1661-1665. doi: 10.1038/s41592-023-02018-3. Epub 2023 Sep 21.

Genomic sketching with multiplicities and locality-sensitive hashing using Dashing 2.使用 Dashing 2 进行多重性和位置敏感哈希的基因组草图绘制。

Genome Res. 2023 Jul;33(7):1218-1227. doi: 10.1101/gr.277655.123. Epub 2023 Jul 6.

Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash.使用 FracMinHash 在广泛的进化距离范围内推导突变率的置信区间。

Genome Res. 2023 Jul;33(7):1061-1068. doi: 10.1101/gr.277651.123. Epub 2023 Jun 21.

IMG/VR v4: an expanded database of uncultivated virus genomes within a framework of extensive functional, taxonomic, and ecological metadata.IMG/VR v4：一个扩展的未培养病毒基因组数据库，其中包含广泛的功能、分类和生态元数据框架。

Nucleic Acids Res. 2023 Jan 6;51(D1):D733-D743. doi: 10.1093/nar/gkac1037.

GTDB-Tk v2: memory friendly classification with the genome taxonomy database.GTDB-Tk v2：使用基因组分类数据库实现内存友好的分类。

Bioinformatics. 2022 Nov 30;38(23):5315-5316. doi: 10.1093/bioinformatics/btac672.

CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices.CMash：基于 k-mer 的 Jaccard 和包含指数的快速、多分辨率估计。

Bioinformatics. 2022 Jun 24;38(Suppl 1):i28-i35. doi: 10.1093/bioinformatics/btac237.

The OceanDNA MAG catalog contains over 50,000 prokaryotic genomes originated from various marine environments.海洋 DNA MAG 目录包含了超过 50000 个源自各种海洋环境的原核生物基因组。

Sci Data. 2022 Jun 17;9(1):305. doi: 10.1038/s41597-022-01392-5.

FragGeneScanRs: faster gene prediction for short reads.FragGeneScanRs：用于短读长的更快基因预测。

BMC Bioinformatics. 2022 May 28;23(1):198. doi: 10.1186/s12859-022-04736-5.

GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy.GTDB：通过系统发生一致、等级归一化和基于完整基因组的分类学，对细菌和古菌多样性进行持续普查。

Nucleic Acids Res. 2022 Jan 7;50(D1):D785-D794. doi: 10.1093/nar/gkab776.

Interactions between bacterial and phage communities in natural environments.自然环境中细菌和噬菌体群落的相互作用。

Nat Rev Microbiol. 2022 Jan;20(1):49-62. doi: 10.1038/s41579-021-00602-y. Epub 2021 Aug 9.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

GSearch：通过组合 K -mer 哈希和分层可导航小世界图实现超快速和可扩展的基因组搜索。

GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献