Nikaein Hassan, Sharifi-Zarchi Ali
Department of Computer Engineering, Sharif University of Technology, Tehran, 1458889694, Iran.
Bioinform Adv. 2025 Jul 23;5(1):vbaf144. doi: 10.1093/bioadv/vbaf144. eCollection 2025.
Locality-Sensitive Hashing (LSH) is a widely used algorithm for estimating similarity between large datasets in bioinformatics, with applications in genome assembly, sequence alignment, and metagenomics. However, traditional single-metric LSH approaches often lead to inefficiencies, particularly when handling biological data where regions may have diverse evolutionary histories or structural properties. This limitation can reduce accuracy in sequence alignment, variant calling, and functional analysis.
We propose Multi-Metric Locality-Sensitive Hashing (M2LSH), an extension of LSH that integrates multiple similarity metrics for more accurate analysis of complex biological data. By capturing diverse sequence and structural features, M2LSH improves performance in heterogeneous and evolutionarily diverse regions. Building on this, we introduce Multi-Metric MinHash (M3Hash), enhancing sequence alignment and similarity detection. As a proof of concept, we present BisHash, which applies M2LSH to bisulfite sequencing, a key method in DNA methylation analysis. Although not fully optimized, BisHash demonstrates superior accuracy, particularly in challenging scenarios like cancer studies where traditional approaches often fail. Our results highlight the potential of M2LSH and M3Hash to advance bioinformatics research.
The source code for BisHash and the test procedures for benchmarking aligners using simulated data are publicly accessible at https://github.com/hnikaein/bisHash.
局部敏感哈希(Locality-Sensitive Hashing,LSH)是生物信息学中用于估计大型数据集之间相似度的一种广泛使用的算法,应用于基因组组装、序列比对和宏基因组学。然而,传统的单度量LSH方法常常导致效率低下,特别是在处理生物数据时,其中不同区域可能具有不同的进化历史或结构特性。这种局限性会降低序列比对、变异检测和功能分析的准确性。
我们提出了多度量局部敏感哈希(Multi-Metric Locality-Sensitive Hashing,M2LSH),它是LSH的一种扩展,集成了多个相似度度量,以便更准确地分析复杂的生物数据。通过捕获不同的序列和结构特征,M2LSH提高了在异质和进化多样区域的性能。在此基础上,我们引入了多度量MinHash(Multi-Metric MinHash,M3Hash),增强了序列比对和相似度检测。作为概念验证,我们展示了BisHash,它将M2LSH应用于亚硫酸氢盐测序,这是DNA甲基化分析中的一种关键方法。尽管尚未完全优化,但BisHash展示了卓越的准确性,特别是在癌症研究等传统方法常常失效的具有挑战性的场景中。我们的结果突出了M2LSH和M3Hash在推进生物信息学研究方面的潜力。
BisHash的源代码以及使用模拟数据对比对工具进行基准测试的测试程序可在https://github.com/hnikaein/bisHash上公开获取。