• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于编辑距离的局部敏感桶函数。

Locality-sensitive bucketing functions for the edit distance.

作者信息

Chen Ke, Shao Mingfu

机构信息

Department of Computer Science and Engineering, The Pennsylvania State University, State College, United States.

Huck Institutes of the Life Sciences, The Pennsylvania State University, State College, United States.

出版信息

Algorithms Mol Biol. 2023 Jul 24;18(1):7. doi: 10.1186/s13015-023-00234-2.

DOI:10.1186/s13015-023-00234-2
PMID:37488600
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10367278/
Abstract

BACKGROUND

Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing k-mer-based bucketing methods have been efficient in processing sequencing data with low error rates, but encounter much reduced sensitivity on data with high error rates. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps.

RESULTS

In this paper, we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be [Formula: see text]-sensitive if any two sequences within an edit distance of [Formula: see text] are mapped into at least one shared bucket, and any two sequences with distance at least [Formula: see text] are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of [Formula: see text] and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal.

CONCLUSION

These results lay the theoretical foundations for their practical use in analyzing sequences with high error rates while also providing insights for the hardness of designing ungapped LSH functions.

摘要

背景

许多生物信息学应用涉及对一组序列进行分桶,其中每个序列可以被分配到多个桶中。为了实现高灵敏度和高精度,分桶方法需要将相似的序列分配到同一个桶中,同时将不相似的序列分配到不同的桶中。现有的基于k-mer的分桶方法在处理低错误率的测序数据时效率很高,但在处理高错误率的数据时灵敏度会大大降低。局部敏感哈希(Locality-sensitive hashing,LSH)方案能够通过容忍相似序列中的编辑来缓解这个问题,但目前的先进方法仍然存在很大差距。

结果

在本文中,我们通过允许LSH函数将一个序列哈希到多个桶中,对其进行了推广。形式上,如果编辑距离为[公式:见原文]的任意两个序列被映射到至少一个共享桶中,并且距离至少为[公式:见原文]的任意两个序列被映射到不相交的桶子集,则将一个将固定长度的序列映射到桶子集的分桶函数定义为[公式:见原文]敏感的。我们构造了具有各种[公式:见原文]值的局部敏感分桶(Locality-sensitive bucketing,LSB)函数,并分析了它们相对于所需桶的总数以及特定序列被映射到的桶数的效率。我们还证明了这两个参数在不同设置下的下界,并表明我们构造的一些LSB函数是最优的。

结论

这些结果为它们在分析高错误率序列中的实际应用奠定了理论基础,同时也为设计无间隙LSH函数的难度提供了见解。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7c62/10367278/9f2e73358f9e/13015_2023_234_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7c62/10367278/09ef829f2241/13015_2023_234_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7c62/10367278/9f2e73358f9e/13015_2023_234_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7c62/10367278/09ef829f2241/13015_2023_234_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7c62/10367278/9f2e73358f9e/13015_2023_234_Fig2_HTML.jpg

相似文献

1
Locality-sensitive bucketing functions for the edit distance.用于编辑距离的局部敏感桶函数。
Algorithms Mol Biol. 2023 Jul 24;18(1):7. doi: 10.1186/s13015-023-00234-2.
2
Learning locality-sensitive bucketing functions.学习位置敏感的分桶函数。
Bioinformatics. 2024 Jun 28;40(Suppl 1):i318-i327. doi: 10.1093/bioinformatics/btae228.
3
Locality-sensitive hashing for the edit distance.基于编辑距离的位置敏感哈希
Bioinformatics. 2019 Jul 15;35(14):i127-i135. doi: 10.1093/bioinformatics/btz354.
4
Fast Comparative Analysis of Merge Trees Using Locality Sensitive Hashing.使用局部敏感哈希的合并树快速比较分析
IEEE Trans Vis Comput Graph. 2025 Jan;31(1):141-151. doi: 10.1109/TVCG.2024.3456383. Epub 2024 Nov 25.
5
Robust hashing with local models for approximate similarity search.基于局部模型的鲁棒哈希用于近似相似度搜索。
IEEE Trans Cybern. 2014 Jul;44(7):1225-36. doi: 10.1109/TCYB.2013.2289351.
6
Deep Constrained Siamese Hash Coding Network and Load-Balanced Locality-Sensitive Hashing for Near Duplicate Image Detection.深度约束型孪生哈希编码网络与负载均衡型局部敏感哈希在近似重复图像检测中的应用。
IEEE Trans Image Process. 2018 Sep;27(9):4452-4464. doi: 10.1109/TIP.2018.2839886.
7
Shared Nearest Neighbor Clustering in a Locality Sensitive Hashing Framework.局部敏感哈希框架下的共享最近邻聚类
J Comput Biol. 2018 Feb;25(2):236-250. doi: 10.1089/cmb.2017.0113. Epub 2017 Sep 27.
8
Secure discovery of genetic relatives across large-scale and distributed genomic data sets.在大规模和分布式基因组数据集上安全发现遗传亲属。
Genome Res. 2024 Oct 11;34(9):1312-1323. doi: 10.1101/gr.279057.124.
9
On the Maximal Independent Sets of -mers with the Edit Distance.关于具有编辑距离的 - 聚体的最大独立集
ACM BCB. 2023 Sep;2023. doi: 10.1145/3584371.3612982. Epub 2023 Oct 4.
10
Stratified locality-sensitive hashing for accelerated physiological time series retrieval.用于加速生理时间序列检索的分层局部敏感哈希
Annu Int Conf IEEE Eng Med Biol Soc. 2016 Aug;2016:2479-2483. doi: 10.1109/EMBC.2016.7591233.

引用本文的文献

1
Construction of edit-distance graphs for large sets of short reads through minimizer-bucketing.通过最小化器分桶为大量短读段构建编辑距离图。
Bioinform Adv. 2025 Apr 10;5(1):vbaf081. doi: 10.1093/bioadv/vbaf081. eCollection 2025.
2
Learning locality-sensitive bucketing functions.学习位置敏感的分桶函数。
Bioinformatics. 2024 Jun 28;40(Suppl 1):i318-i327. doi: 10.1093/bioinformatics/btae228.

本文引用的文献

1
Effective sequence similarity detection with strobemers.利用频闪体进行有效的序列相似性检测。
Genome Res. 2021 Nov;31(11):2080-2094. doi: 10.1101/gr.275648.121. Epub 2021 Oct 19.
2
Overlap detection on long, error-prone sequencing reads via smooth q-gram.通过平滑 q-gram 检测长、易错测序读长中的重叠
Bioinformatics. 2020 Dec 8;36(19):4838-4845. doi: 10.1093/bioinformatics/btaa252.
3
Locality-sensitive hashing for the edit distance.基于编辑距离的位置敏感哈希
Bioinformatics. 2019 Jul 15;35(14):i127-i135. doi: 10.1093/bioinformatics/btz354.
4
Deciphering highly similar multigene family transcripts from Iso-Seq data with IsoCon.从 Iso-Seq 数据中破译高度相似的多基因家族转录本,使用 IsoCon。
Nat Commun. 2018 Nov 2;9(1):4601. doi: 10.1038/s41467-018-06910-x.
5
Asymptotically optimal minimizers schemes.渐近最优极小化方案。
Bioinformatics. 2018 Jul 1;34(13):i13-i22. doi: 10.1093/bioinformatics/bty258.
6
Minimap2: pairwise alignment for nucleotide sequences.Minimap2:核苷酸序列的两两比对。
Bioinformatics. 2018 Sep 15;34(18):3094-3100. doi: 10.1093/bioinformatics/bty191.
7
Nanopore sequencing and assembly of a human genome with ultra-long reads.纳米孔测序和超长读长组装人类基因组。
Nat Biotechnol. 2018 Apr;36(4):338-345. doi: 10.1038/nbt.4060. Epub 2018 Jan 29.
8
Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing.设计小型通用k-mer命中集以改进对高通量测序的分析
PLoS Comput Biol. 2017 Oct 2;13(10):e1005777. doi: 10.1371/journal.pcbi.1005777. eCollection 2017 Oct.
9
A comprehensive review and comparison of different computational methods for protein remote homology detection.蛋白质远程同源检测不同计算方法的综合回顾与比较。
Brief Bioinform. 2018 Mar 1;19(2):231-244. doi: 10.1093/bib/bbw108.
10
PacBio Sequencing and Its Applications.PacBio测序技术及其应用。
Genomics Proteomics Bioinformatics. 2015 Oct;13(5):278-89. doi: 10.1016/j.gpb.2015.08.002. Epub 2015 Nov 2.