Suppr超能文献

多指标局部敏感哈希提高亚硫酸氢盐测序读段的比对准确性:BisHash。

Multi-metric locality sensitive hashing enhances alignment accuracy of bisulfite sequencing reads: BisHash.

作者信息

Nikaein Hassan, Sharifi-Zarchi Ali

机构信息

Department of Computer Engineering, Sharif University of Technology, Tehran, 1458889694, Iran.

出版信息

Bioinform Adv. 2025 Jul 23;5(1):vbaf144. doi: 10.1093/bioadv/vbaf144. eCollection 2025.

Abstract

MOTIVATION

Locality-Sensitive Hashing (LSH) is a widely used algorithm for estimating similarity between large datasets in bioinformatics, with applications in genome assembly, sequence alignment, and metagenomics. However, traditional single-metric LSH approaches often lead to inefficiencies, particularly when handling biological data where regions may have diverse evolutionary histories or structural properties. This limitation can reduce accuracy in sequence alignment, variant calling, and functional analysis.

RESULTS

We propose Multi-Metric Locality-Sensitive Hashing (M2LSH), an extension of LSH that integrates multiple similarity metrics for more accurate analysis of complex biological data. By capturing diverse sequence and structural features, M2LSH improves performance in heterogeneous and evolutionarily diverse regions. Building on this, we introduce Multi-Metric MinHash (M3Hash), enhancing sequence alignment and similarity detection. As a proof of concept, we present BisHash, which applies M2LSH to bisulfite sequencing, a key method in DNA methylation analysis. Although not fully optimized, BisHash demonstrates superior accuracy, particularly in challenging scenarios like cancer studies where traditional approaches often fail. Our results highlight the potential of M2LSH and M3Hash to advance bioinformatics research.

AVAILABILITY AND IMPLEMENTATION

The source code for BisHash and the test procedures for benchmarking aligners using simulated data are publicly accessible at https://github.com/hnikaein/bisHash.

摘要

动机

局部敏感哈希(Locality-Sensitive Hashing,LSH)是生物信息学中用于估计大型数据集之间相似度的一种广泛使用的算法,应用于基因组组装、序列比对和宏基因组学。然而,传统的单度量LSH方法常常导致效率低下,特别是在处理生物数据时,其中不同区域可能具有不同的进化历史或结构特性。这种局限性会降低序列比对、变异检测和功能分析的准确性。

结果

我们提出了多度量局部敏感哈希(Multi-Metric Locality-Sensitive Hashing,M2LSH),它是LSH的一种扩展,集成了多个相似度度量,以便更准确地分析复杂的生物数据。通过捕获不同的序列和结构特征,M2LSH提高了在异质和进化多样区域的性能。在此基础上,我们引入了多度量MinHash(Multi-Metric MinHash,M3Hash),增强了序列比对和相似度检测。作为概念验证,我们展示了BisHash,它将M2LSH应用于亚硫酸氢盐测序,这是DNA甲基化分析中的一种关键方法。尽管尚未完全优化,但BisHash展示了卓越的准确性,特别是在癌症研究等传统方法常常失效的具有挑战性的场景中。我们的结果突出了M2LSH和M3Hash在推进生物信息学研究方面的潜力。

可用性与实现

BisHash的源代码以及使用模拟数据对比对工具进行基准测试的测试程序可在https://github.com/hnikaein/bisHash上公开获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/33fc/12360834/d0d48445cd7f/vbaf144f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验