Xu Weihong, Hsu Po-Kai, Moshiri Niema, Yu Shimeng, Rosing Tajana
Department of Computer Science and Engineering, University of California San Diego, CA 92093, USA.
School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA.
Bioinformatics. 2024 Jul 16;40(7). doi: 10.1093/bioinformatics/btae452.
Genomic distance estimation is a critical workload since exact computation for whole-genome similarity metrics such as Average Nucleotide Identity (ANI) incurs prohibitive runtime overhead. Genome sketching is a fast and memory-efficient solution to estimate ANI similarity by distilling representative k-mers from the original sequences. In this work, we present HyperGen that improves accuracy, runtime performance, and memory efficiency for large-scale ANI estimation. Unlike existing genome sketching algorithms that convert large genome files into discrete k-mer hashes, HyperGen leverages the emerging hyperdimensional computing (HDC) to encode genomes into quasi-orthogonal vectors (Hypervector, HV) in high-dimensional space. HV is compact and can preserve more information, allowing for accurate ANI estimation while reducing required sketch sizes. In particular, the HV sketch representation in HyperGen allows efficient ANI estimation using vector multiplication, which naturally benefits from highly optimized general matrix multiply (GEMM) routines. As a result, HyperGen enables the efficient sketching and ANI estimation for massive genome collections.
We evaluate HyperGen 's sketching and database search performance using several genome datasets at various scales. HyperGen is able to achieve comparable or superior ANI estimation error and linearity compared to other sketch-based counterparts. The measurement results show that HyperGen is one of the fastest tools for both genome sketching and database search. Meanwhile, HyperGen produces memory-efficient sketch files while ensuring high ANI estimation accuracy.
A Rust implementation of HyperGen is freely available under the MIT license as an open-source software project at https://github.com/wh-xu/Hyper-Gen. The scripts to reproduce the experimental results can be accessed at https://github.com/wh-xu/experiment-hyper-gen.
基因组距离估计是一项关键任务,因为对全基因组相似性指标(如平均核苷酸一致性(ANI))进行精确计算会带来极高的运行时开销。基因组草图绘制是一种快速且内存高效的解决方案,通过从原始序列中提取代表性的k-mer来估计ANI相似性。在这项工作中,我们提出了HyperGen,它提高了大规模ANI估计的准确性、运行时性能和内存效率。与现有的将大型基因组文件转换为离散k-mer哈希的基因组草图绘制算法不同,HyperGen利用新兴的超维计算(HDC)将基因组编码为高维空间中的准正交向量(超向量,HV)。HV紧凑且能保留更多信息,在减少所需草图大小的同时允许进行准确的ANI估计。特别是,HyperGen中的HV草图表示允许使用向量乘法进行高效的ANI估计,这自然受益于高度优化的通用矩阵乘法(GEMM)例程。因此,HyperGen能够对大量基因组集合进行高效的草图绘制和ANI估计。
我们使用多个不同规模的基因组数据集评估了HyperGen的草图绘制和数据库搜索性能。与其他基于草图的方法相比,HyperGen能够实现相当或更优的ANI估计误差和线性度。测量结果表明,HyperGen是基因组草图绘制和数据库搜索中最快的工具之一。同时,HyperGen在确保高ANI估计准确性的同时生成内存高效的草图文件。
HyperGen的Rust实现作为一个开源软件项目,根据MIT许可在https://github.com/wh-xu/Hyper-Gen上免费提供。可在https://github.com/wh-xu/experiment-hyper-gen上获取重现实验结果的脚本。