• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

HyperGen:使用超维向量进行紧凑且高效的基因组草图绘制

HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors.

作者信息

Xu Weihong, Hsu Po-Kai, Moshiri Niema, Yu Shimeng, Rosing Tajana

机构信息

Department of Computer Science and Engineering, University of California San Diego, CA 92093, USA.

School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA.

出版信息

Bioinformatics. 2024 Jul 16;40(7). doi: 10.1093/bioinformatics/btae452.

DOI:10.1093/bioinformatics/btae452
PMID:39012512
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11281827/
Abstract

MOTIVATION

Genomic distance estimation is a critical workload since exact computation for whole-genome similarity metrics such as Average Nucleotide Identity (ANI) incurs prohibitive runtime overhead. Genome sketching is a fast and memory-efficient solution to estimate ANI similarity by distilling representative k-mers from the original sequences. In this work, we present HyperGen that improves accuracy, runtime performance, and memory efficiency for large-scale ANI estimation. Unlike existing genome sketching algorithms that convert large genome files into discrete k-mer hashes, HyperGen leverages the emerging hyperdimensional computing (HDC) to encode genomes into quasi-orthogonal vectors (Hypervector, HV) in high-dimensional space. HV is compact and can preserve more information, allowing for accurate ANI estimation while reducing required sketch sizes. In particular, the HV sketch representation in HyperGen allows efficient ANI estimation using vector multiplication, which naturally benefits from highly optimized general matrix multiply (GEMM) routines. As a result, HyperGen enables the efficient sketching and ANI estimation for massive genome collections.

RESULTS

We evaluate HyperGen 's sketching and database search performance using several genome datasets at various scales. HyperGen is able to achieve comparable or superior ANI estimation error and linearity compared to other sketch-based counterparts. The measurement results show that HyperGen is one of the fastest tools for both genome sketching and database search. Meanwhile, HyperGen produces memory-efficient sketch files while ensuring high ANI estimation accuracy.

AVAILABILITY

A Rust implementation of HyperGen is freely available under the MIT license as an open-source software project at https://github.com/wh-xu/Hyper-Gen. The scripts to reproduce the experimental results can be accessed at https://github.com/wh-xu/experiment-hyper-gen.

摘要

动机

基因组距离估计是一项关键任务,因为对全基因组相似性指标(如平均核苷酸一致性(ANI))进行精确计算会带来极高的运行时开销。基因组草图绘制是一种快速且内存高效的解决方案,通过从原始序列中提取代表性的k-mer来估计ANI相似性。在这项工作中,我们提出了HyperGen,它提高了大规模ANI估计的准确性、运行时性能和内存效率。与现有的将大型基因组文件转换为离散k-mer哈希的基因组草图绘制算法不同,HyperGen利用新兴的超维计算(HDC)将基因组编码为高维空间中的准正交向量(超向量,HV)。HV紧凑且能保留更多信息,在减少所需草图大小的同时允许进行准确的ANI估计。特别是,HyperGen中的HV草图表示允许使用向量乘法进行高效的ANI估计,这自然受益于高度优化的通用矩阵乘法(GEMM)例程。因此,HyperGen能够对大量基因组集合进行高效的草图绘制和ANI估计。

结果

我们使用多个不同规模的基因组数据集评估了HyperGen的草图绘制和数据库搜索性能。与其他基于草图的方法相比,HyperGen能够实现相当或更优的ANI估计误差和线性度。测量结果表明,HyperGen是基因组草图绘制和数据库搜索中最快的工具之一。同时,HyperGen在确保高ANI估计准确性的同时生成内存高效的草图文件。

可用性

HyperGen的Rust实现作为一个开源软件项目,根据MIT许可在https://github.com/wh-xu/Hyper-Gen上免费提供。可在https://github.com/wh-xu/experiment-hyper-gen上获取重现实验结果的脚本。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c70b/11281827/4f52ebe1ac0d/btae452f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c70b/11281827/cc8cc118a3a3/btae452f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c70b/11281827/e4b7806aede5/btae452f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c70b/11281827/fd25519089ce/btae452f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c70b/11281827/c06cd48638a6/btae452f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c70b/11281827/4f52ebe1ac0d/btae452f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c70b/11281827/cc8cc118a3a3/btae452f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c70b/11281827/e4b7806aede5/btae452f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c70b/11281827/fd25519089ce/btae452f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c70b/11281827/c06cd48638a6/btae452f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c70b/11281827/4f52ebe1ac0d/btae452f5.jpg

相似文献

1
HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors.HyperGen:使用超维向量进行紧凑且高效的基因组草图绘制
Bioinformatics. 2024 Jul 16;40(7). doi: 10.1093/bioinformatics/btae452.
2
Cosine Similarity Estimation Using FracMinHash: Theoretical Analysis, Safety Conditions, and Implementation.使用FracMinHash的余弦相似度估计:理论分析、安全条件及实现
bioRxiv. 2024 May 30:2024.05.24.595805. doi: 10.1101/2024.05.24.595805.
3
Sketching Methods with Small Window Guarantee Using Minimum Decycling Sets.使用最小去环集保证小窗口的草图方法。
J Comput Biol. 2024 Jul;31(7):597-615. doi: 10.1089/cmb.2024.0544. Epub 2024 Jul 9.
4
Genomic sketching with multiplicities and locality-sensitive hashing using Dashing 2.使用 Dashing 2 进行多重性和位置敏感哈希的基因组草图绘制。
Genome Res. 2023 Jul;33(7):1218-1227. doi: 10.1101/gr.277655.123. Epub 2023 Jul 6.
5
ntCard: a streaming algorithm for cardinality estimation in genomics data.ntCard:一种用于基因组数据基数估计的流算法。
Bioinformatics. 2017 May 1;33(9):1324-1330. doi: 10.1093/bioinformatics/btw832.
6
Sketching and sampling approaches for fast and accurate long read classification.快速准确的长读分类的草图和采样方法。
BMC Bioinformatics. 2022 Oct 31;23(1):452. doi: 10.1186/s12859-022-05014-0.
7
Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching Scheme.广义掩蔽最小化草图方案的密度和守恒优化。
J Comput Biol. 2024 Jan;31(1):2-20. doi: 10.1089/cmb.2023.0212. Epub 2023 Nov 17.
8
Sketching methods with small window guarantee using minimum decycling sets.使用最小去环集保证小窗口的绘制方法。
ArXiv. 2023 Nov 6:arXiv:2311.03592v1.
9
LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes.LINflow:一种计算流程,它将一种无比对方法与一种基于比对的方法相结合,以加速原核生物基因组相似性矩阵的生成。
PeerJ. 2021 Mar 24;9:e10906. doi: 10.7717/peerj.10906. eCollection 2021.
10
Set-Min Sketch: A Probabilistic Map for Power-Law Distributions with Application to -Mer Annotation.集最小草图:用于幂律分布的概率图及其在 -Mer 注释中的应用。
J Comput Biol. 2022 Feb;29(2):140-154. doi: 10.1089/cmb.2021.0429. Epub 2022 Jan 18.

引用本文的文献

1
Hyperdimensional computing in biomedical sciences: a brief review.生物医学科学中的超维计算:简要综述
PeerJ Comput Sci. 2025 May 13;11:e2885. doi: 10.7717/peerj-cs.2885. eCollection 2025.
2
EvANI benchmarking workflow for evolutionary distance estimation.用于进化距离估计的EvANI基准测试工作流程。
Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf267.
3
EvANI benchmarking workflow for evolutionary distance estimation.用于进化距离估计的EvANI基准测试工作流程。

本文引用的文献

1
Fast and robust metagenomic sequence comparison through sparse chaining with skani.通过使用 skani 进行稀疏链接实现快速稳健的宏基因组序列比较。
Nat Methods. 2023 Nov;20(11):1661-1665. doi: 10.1038/s41592-023-02018-3. Epub 2023 Sep 21.
2
Fast genome-based delimitation of Enterobacterales species.基于基因组的肠杆菌目种快速划分。
PLoS One. 2023 Sep 14;18(9):e0291492. doi: 10.1371/journal.pone.0291492. eCollection 2023.
3
Genomic sketching with multiplicities and locality-sensitive hashing using Dashing 2.使用 Dashing 2 进行多重性和位置敏感哈希的基因组草图绘制。
bioRxiv. 2025 Feb 23:2025.02.23.639716. doi: 10.1101/2025.02.23.639716.
Genome Res. 2023 Jul;33(7):1218-1227. doi: 10.1101/gr.277655.123. Epub 2023 Jul 6.
4
Accelerating open modification spectral library searching on tensor core in high-dimensional space.在高维空间的张量核上加速开放修改谱库搜索。
Bioinformatics. 2023 Jul 1;39(7). doi: 10.1093/bioinformatics/btad404.
5
Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash.使用 FracMinHash 在广泛的进化距离范围内推导突变率的置信区间。
Genome Res. 2023 Jul;33(7):1061-1068. doi: 10.1101/gr.277651.123. Epub 2023 Jun 21.
6
HyperSpec: Ultrafast Mass Spectra Clustering in Hyperdimensional Space.超高维空间中的超快质谱聚类分析
J Proteome Res. 2023 Jun 2;22(6):1639-1648. doi: 10.1021/acs.jproteome.2c00612. Epub 2023 May 11.
7
BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis.BLEND:一种在基因组分析中快速、节省内存且准确地查找模糊种子匹配项的机制。
NAR Genom Bioinform. 2023 Jan 20;5(1):lqad004. doi: 10.1093/nargab/lqad004. eCollection 2023 Mar.
8
GTDB-Tk v2: memory friendly classification with the genome taxonomy database.GTDB-Tk v2:使用基因组分类数据库实现内存友好的分类。
Bioinformatics. 2022 Nov 30;38(23):5315-5316. doi: 10.1093/bioinformatics/btac672.
9
CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices.CMash:基于 k-mer 的 Jaccard 和包含指数的快速、多分辨率估计。
Bioinformatics. 2022 Jun 24;38(Suppl 1):i28-i35. doi: 10.1093/bioinformatics/btac237.
10
A complete domain-to-species taxonomy for Bacteria and Archaea.细菌和古菌的完整域到种分类 taxonomy。
Nat Biotechnol. 2020 Sep;38(9):1079-1086. doi: 10.1038/s41587-020-0501-8. Epub 2020 Apr 27.