Suppr超能文献

GSearch:通过组合 K -mer 哈希和分层可导航小世界图实现超快速和可扩展的基因组搜索。

GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs.

机构信息

Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, Atlanta, GA, USA.

School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA.

出版信息

Nucleic Acids Res. 2024 Sep 9;52(16):e74. doi: 10.1093/nar/gkae609.

Abstract

Genome search and/or classification typically involves finding the best-match database (reference) genomes and has become increasingly challenging due to the growing number of available database genomes and the fact that traditional methods do not scale well with large databases. By combining k-mer hashing-based probabilistic data structures (i.e. ProbMinHash, SuperMinHash, Densified MinHash and SetSketch) to estimate genomic distance, with a graph based nearest neighbor search algorithm (Hierarchical Navigable Small World Graphs, or HNSW), we created a new data structure and developed an associated computer program, GSearch, that is orders of magnitude faster than alternative tools while maintaining high accuracy and low memory usage. For example, GSearch can search 8000 query genomes against all available microbial or viral genomes for their best matches (n = ∼318 000 or ∼3 000 000, respectively) within a few minutes on a personal laptop, using ∼6 GB of memory (2.5 GB via SetSketch). Notably, GSearch has an O(log(N)) time complexity and will scale well with billions of genomes based on a database splitting strategy. Further, GSearch implements a three-step search strategy depending on the degree of novelty of the query genomes to maximize specificity and sensitivity. Therefore, GSearch solves a major bottleneck of microbiome studies that require genome search and/or classification.

摘要

基因组搜索和/或分类通常涉及找到最佳匹配的数据库(参考)基因组,由于可用数据库基因组的数量不断增加,以及传统方法不适用于大型数据库的事实,这一任务变得越来越具有挑战性。我们通过结合基于 k-mer 哈希的概率数据结构(即 ProbMinHash、SuperMinHash、Densified MinHash 和 SetSketch)来估计基因组距离,并使用基于图的最近邻搜索算法(层次可导航小世界图或 HNSW),创建了一种新的数据结构,并开发了一个相关的计算机程序 GSearch,该程序的速度比替代工具快几个数量级,同时保持高精度和低内存使用。例如,GSearch 可以在个人笔记本电脑上在几分钟内搜索 8000 个查询基因组与所有可用的微生物或病毒基因组的最佳匹配(n = ∼318 000 或 ∼3 000 000,分别),使用约 6GB 的内存(通过 SetSketch 使用 2.5GB)。值得注意的是,GSearch 的时间复杂度为 O(log(N)),并且基于数据库拆分策略,可以很好地扩展到数十亿个基因组。此外,GSearch 根据查询基因组的新颖程度实施了三步搜索策略,以最大化特异性和灵敏度。因此,GSearch 解决了微生物组研究中需要基因组搜索和/或分类的主要瓶颈。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8eb3/11381346/7dbdae08766b/gkae609figgra1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验