• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

LexicHash:通过字典序比较哈希值进行序列相似性估计。

LexicHash: sequence similarity estimation via lexicographic comparison of hashes.

机构信息

Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, United States.

出版信息

Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad652.

DOI:10.1093/bioinformatics/btad652
PMID:37878809
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10628434/
Abstract

MOTIVATION

Pairwise sequence alignment is a heavy computational burden, particularly in the context of third-generation sequencing technologies. This issue is commonly addressed by approximately estimating sequence similarities using a hash-based method such as MinHash. In MinHash, all k-mers in a read are hashed and the minimum hash value, the min-hash, is stored. Pairwise similarities can then be estimated by counting the number of min-hash matches between a pair of reads, across many distinct hash functions. The choice of the parameter k controls an important tradeoff in the task of identifying alignments: larger k-values give greater confidence in the identification of alignments (high precision) but can lead to many missing alignments (low recall), particularly in the presence of significant noise.

RESULTS

In this work, we introduce LexicHash, a new similarity estimation method that is effectively independent of the choice of k and attains the high precision of large-k and the high sensitivity of small-k MinHash. LexicHash is a variant of MinHash with a carefully designed hash function. When estimating the similarity between two reads, instead of simply checking whether min-hashes match (as in standard MinHash), one checks how "lexicographically similar" the LexicHash min-hashes are. In our experiments on 40 PacBio datasets, the area under the precision-recall curves obtained by LexicHash had an average improvement of 20.9% over MinHash. Additionally, the LexicHash framework lends itself naturally to an efficient search of the largest alignments, yielding an O(n) time algorithm, and circumventing the seemingly fundamental O(n2) scaling associated with pairwise similarity search.

AVAILABILITY AND IMPLEMENTATION

LexicHash is available on GitHub at https://github.com/gcgreenberg/LexicHash.

摘要

动机

两两序列比对是一项计算量很大的任务,尤其是在第三代测序技术的背景下。这个问题通常通过使用基于哈希的方法(如 MinHash)来近似估计序列相似度来解决。在 MinHash 中,读取中的所有 k-mer 都被哈希化,并且存储最小哈希值,即 min-hash。然后可以通过在许多不同的哈希函数之间计算一对读取之间的最小哈希匹配数来估计成对相似度。参数 k 的选择控制着识别比对任务中的一个重要权衡:较大的 k 值可以更有信心地识别比对(高准确率),但可能导致许多比对缺失(低召回率),尤其是在存在大量噪声的情况下。

结果

在这项工作中,我们引入了 LexicHash,这是一种新的相似度估计方法,它实际上与 k 的选择无关,并实现了大 k 值的高精度和小 k 值 MinHash 的高灵敏度。LexicHash 是 MinHash 的一种变体,具有精心设计的哈希函数。在估计两个读取之间的相似度时,不是简单地检查 min-hashes 是否匹配(如标准 MinHash 中那样),而是检查 LexicHash min-hashes 有多“字典序相似”。在我们对 40 个 PacBio 数据集的实验中,LexicHash 获得的精度-召回曲线下面积平均比 MinHash 提高了 20.9%。此外,LexicHash 框架自然适用于最大比对的高效搜索,生成一个 O(n)时间算法,并避免了与成对相似度搜索相关的看似基本的 O(n2) 缩放。

可用性和实现

LexicHash 可在 GitHub 上获得,网址为 https://github.com/gcgreenberg/LexicHash。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51c5/10628434/02680ea007a1/btad652f8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51c5/10628434/2c37b4c24451/btad652f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51c5/10628434/552d6ed0e70b/btad652f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51c5/10628434/dad5787bf037/btad652f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51c5/10628434/87c7622e2eb4/btad652f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51c5/10628434/c0b00fe31fd6/btad652f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51c5/10628434/4d90e46db0b0/btad652f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51c5/10628434/ca3f92c07e38/btad652f7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51c5/10628434/02680ea007a1/btad652f8.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51c5/10628434/2c37b4c24451/btad652f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51c5/10628434/552d6ed0e70b/btad652f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51c5/10628434/dad5787bf037/btad652f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51c5/10628434/87c7622e2eb4/btad652f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51c5/10628434/c0b00fe31fd6/btad652f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51c5/10628434/4d90e46db0b0/btad652f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51c5/10628434/ca3f92c07e38/btad652f7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51c5/10628434/02680ea007a1/btad652f8.jpg

相似文献

1
LexicHash: sequence similarity estimation via lexicographic comparison of hashes.LexicHash:通过字典序比较哈希值进行序列相似性估计。
Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad652.
2
Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments.光谱杰卡德相似度:一种估计成对序列比对的新方法。
Patterns (N Y). 2020 Jul 31;1(6):100081. doi: 10.1016/j.patter.2020.100081. eCollection 2020 Sep 11.
3
Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis.基于哈希的重排序压缩基因组测序reads:算法与分析。
Bioinformatics. 2018 Feb 15;34(4):558-567. doi: 10.1093/bioinformatics/btx639.
4
Improving the sensitivity of long read overlap detection using grouped short k-mer matches.利用分组短 k-mer 匹配提高长读重叠检测的灵敏度。
BMC Genomics. 2019 Apr 4;20(Suppl 2):190. doi: 10.1186/s12864-019-5475-x.
5
Turtle: identifying frequent k-mers with cache-efficient algorithms.海龟:使用缓存高效算法识别频繁的 k-mer。
Bioinformatics. 2014 Jul 15;30(14):1950-7. doi: 10.1093/bioinformatics/btu132. Epub 2014 Mar 10.
6
Sketching and sampling approaches for fast and accurate long read classification.快速准确的长读分类的草图和采样方法。
BMC Bioinformatics. 2022 Oct 31;23(1):452. doi: 10.1186/s12859-022-05014-0.
7
A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases.一种将长读段映射到大型参考数据库的快速近似算法。
J Comput Biol. 2018 Jul;25(7):766-779. doi: 10.1089/cmb.2018.0036. Epub 2018 Apr 30.
8
The Statistics of -mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches.无伪匹配情况下简单突变过程中序列的 -mers 统计。
J Comput Biol. 2022 Feb;29(2):155-168. doi: 10.1089/cmb.2021.0431. Epub 2022 Feb 1.
9
HISEA: HIerarchical SEed Aligner for PacBio data.HISEA:用于PacBio数据的分层种子比对器。
BMC Bioinformatics. 2017 Dec 19;18(1):564. doi: 10.1186/s12859-017-1953-9.
10
CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices.CMash:基于 k-mer 的 Jaccard 和包含指数的快速、多分辨率估计。
Bioinformatics. 2022 Jun 24;38(Suppl 1):i28-i35. doi: 10.1093/bioinformatics/btac237.

引用本文的文献

1
Efficient sequence alignment against millions of prokaryotic genomes with LexicMap.使用LexicMap与数百万个原核生物基因组进行高效序列比对。
Nat Biotechnol. 2025 Sep 10. doi: 10.1038/s41587-025-02812-8.
2
EvANI benchmarking workflow for evolutionary distance estimation.用于进化距离估计的EvANI基准测试工作流程。
Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf267.
3
EvANI benchmarking workflow for evolutionary distance estimation.用于进化距离估计的EvANI基准测试工作流程。

本文引用的文献

1
Entropy predicts sensitivity of pseudorandom seeds.熵预测伪随机种子的敏感性。
Genome Res. 2023 Jul;33(7):1162-1174. doi: 10.1101/gr.277645.123. Epub 2023 May 22.
2
BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis.BLEND:一种在基因组分析中快速、节省内存且准确地查找模糊种子匹配项的机制。
NAR Genom Bioinform. 2023 Jan 20;5(1):lqad004. doi: 10.1093/nargab/lqad004. eCollection 2023 Mar.
3
Parameterized syncmer schemes improve long-read mapping.参数化同步mers 方案提高了长读测序数据的比对效率。
bioRxiv. 2025 Feb 23:2025.02.23.639716. doi: 10.1101/2025.02.23.639716.
4
Sequence similarity estimation by random subsequence sketching.通过随机子序列草图进行序列相似性估计。
bioRxiv. 2025 Feb 8:2025.02.05.636706. doi: 10.1101/2025.02.05.636706.
PLoS Comput Biol. 2022 Oct 28;18(10):e1010638. doi: 10.1371/journal.pcbi.1010638. eCollection 2022 Oct.
4
Theory of local k-mer selection with applications to long-read alignment.基于局部 k-mer 选择的理论及其在长读测序比对中的应用。
Bioinformatics. 2022 Oct 14;38(20):4659-4669. doi: 10.1093/bioinformatics/btab790.
5
Effective sequence similarity detection with strobemers.利用频闪体进行有效的序列相似性检测。
Genome Res. 2021 Nov;31(11):2080-2094. doi: 10.1101/gr.275648.121. Epub 2021 Oct 19.
6
Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer.最小化空间 de Bruijn 图:在个人计算机上数分钟内完成长读段的全基因组组装。
Cell Syst. 2021 Oct 20;12(10):958-968.e6. doi: 10.1016/j.cels.2021.08.009. Epub 2021 Sep 14.
7
Comparison of long-read sequencing technologies in interrogating bacteria and fly genomes.比较长读测序技术在细菌和果蝇基因组分析中的应用。
G3 (Bethesda). 2021 Jun 17;11(6). doi: 10.1093/g3journal/jkab083.
8
Syncmers are more sensitive than minimizers for selecting conserved ‑mers in biological sequences.同步寡聚体在选择生物序列中的保守寡聚体方面比最小寡聚体更敏感。
PeerJ. 2021 Feb 5;9:e10805. doi: 10.7717/peerj.10805. eCollection 2021.
9
Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments.光谱杰卡德相似度:一种估计成对序列比对的新方法。
Patterns (N Y). 2020 Jul 31;1(6):100081. doi: 10.1016/j.patter.2020.100081. eCollection 2020 Sep 11.
10
Improved design and analysis of practical minimizers.实用极小化器的改进设计与分析。
Bioinformatics. 2020 Jul 1;36(Suppl_1):i119-i127. doi: 10.1093/bioinformatics/btaa472.