• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用 Damerau-Levenshtein 距离进行字符串校正。

String correction using the Damerau-Levenshtein distance.

机构信息

Department of Computer and Information Science and Engineering, University of Florida, Gainesville, 32611, FL, USA.

出版信息

BMC Bioinformatics. 2019 Jun 6;20(Suppl 11):277. doi: 10.1186/s12859-019-2819-0.

DOI:10.1186/s12859-019-2819-0
PMID:31167641
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6551241/
Abstract

BACKGROUND

In the string correction problem, we are to transform one string into another using a set of prescribed edit operations. In string correction using the Damerau-Levenshtein (DL) distance, the permissible edit operations are: substitution, insertion, deletion and transposition. Several algorithms for string correction using the DL distance have been proposed. The fastest and most space efficient of these algorithms is due to Lowrance and Wagner. It computes the DL distance between strings of length m and n, respectively, in O(mn) time and O(mn) space. In this paper, we focus on the development of algorithms whose asymptotic space complexity is less and whose actual runtime and energy consumption are less than those of the algorithm of Lowrance and Wagner.

RESULTS

We develop space- and cache-efficient algorithms to compute the Damerau-Levenshtein (DL) distance between two strings as well as to find a sequence of edit operations of length equal to the DL distance. Our algorithms require O(s min{m,n}+m+n) space, where s is the size of the alphabet and m and n are, respectively, the lengths of the two strings. Previously known algorithms require O(mn) space. The space- and cache-efficient algorithms of this paper are demonstrated, experimentally, to be superior to earlier algorithms for the DL distance problem on time, space, and enery metrics using three different computational platforms.

CONCLUSION

Our benchmarking shows that, our algorithms are able to handle much larger sequences than earlier algorithms due to the reduction in space requirements. On a single core, we are able to compute the DL distance and an optimal edit sequence faster than known algorithms by as much as 73.1% and 63.5%, respectively. Further, we reduce energy consumption by as much as 68.5%. Multicore versions of our algorithms achieve a speedup of 23.2 on 24 cores.

摘要

背景

在字符串纠错问题中,我们需要使用一组规定的编辑操作将一个字符串转换为另一个字符串。在使用 Damerau-Levenshtein(DL)距离的字符串纠错中,允许的编辑操作有:替换、插入、删除和转置。已经提出了几种使用 DL 距离的字符串纠错算法。其中最快和最节省空间的算法是 Lowrance 和 Wagner 提出的算法。该算法分别在 O(mn)时间和 O(mn)空间内计算两个长度分别为 m 和 n 的字符串之间的 DL 距离。在本文中,我们专注于开发其渐近空间复杂度更小、实际运行时和能耗都小于 Lowrance 和 Wagner 算法的算法。

结果

我们开发了空间和缓存高效的算法来计算两个字符串之间的 Damerau-Levenshtein(DL)距离以及找到长度等于 DL 距离的编辑操作序列。我们的算法需要 O(s min{m,n}+m+n)空间,其中 s 是字母表的大小,m 和 n 分别是两个字符串的长度。之前已知的算法需要 O(mn)空间。本文的空间和缓存高效算法在三个不同的计算平台上,在时间、空间和能量指标方面,都优于之前的 DL 距离问题算法。

结论

我们的基准测试表明,由于空间需求的减少,我们的算法能够处理比之前算法大得多的序列。在单核上,我们能够比已知算法更快地计算 DL 距离和最优编辑序列,分别快 73.1%和 63.5%。此外,我们还降低了 68.5%的能耗。我们的算法的多核版本在 24 核上实现了 23.2 的加速比。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/4c8e3eba2029/12859_2019_2819_Fig21_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/f5478c3953be/12859_2019_2819_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/3fe24e1f2ce0/12859_2019_2819_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/cbb3b266acf4/12859_2019_2819_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/9d6fab3f364a/12859_2019_2819_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/9adc3612d335/12859_2019_2819_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/9fb582f74fe9/12859_2019_2819_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/73c56d2b535c/12859_2019_2819_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/8fb486f3b266/12859_2019_2819_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/8e27a8bc699f/12859_2019_2819_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/d0107641470e/12859_2019_2819_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/5b9b98e6292c/12859_2019_2819_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/13f09c720816/12859_2019_2819_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/7838c492aace/12859_2019_2819_Fig13_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/af2a0d7ff7ce/12859_2019_2819_Fig14_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/9cd2ac040f7f/12859_2019_2819_Fig15_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/64b8919f9fe4/12859_2019_2819_Fig16_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/d9cca096061c/12859_2019_2819_Fig17_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/d3865bd60cec/12859_2019_2819_Fig18_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/ba16e443a973/12859_2019_2819_Fig19_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/3ffe9ea71a9b/12859_2019_2819_Fig20_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/4c8e3eba2029/12859_2019_2819_Fig21_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/f5478c3953be/12859_2019_2819_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/3fe24e1f2ce0/12859_2019_2819_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/cbb3b266acf4/12859_2019_2819_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/9d6fab3f364a/12859_2019_2819_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/9adc3612d335/12859_2019_2819_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/9fb582f74fe9/12859_2019_2819_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/73c56d2b535c/12859_2019_2819_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/8fb486f3b266/12859_2019_2819_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/8e27a8bc699f/12859_2019_2819_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/d0107641470e/12859_2019_2819_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/5b9b98e6292c/12859_2019_2819_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/13f09c720816/12859_2019_2819_Fig12_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/7838c492aace/12859_2019_2819_Fig13_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/af2a0d7ff7ce/12859_2019_2819_Fig14_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/9cd2ac040f7f/12859_2019_2819_Fig15_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/64b8919f9fe4/12859_2019_2819_Fig16_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/d9cca096061c/12859_2019_2819_Fig17_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/d3865bd60cec/12859_2019_2819_Fig18_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/ba16e443a973/12859_2019_2819_Fig19_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/3ffe9ea71a9b/12859_2019_2819_Fig20_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0aff/6551241/4c8e3eba2029/12859_2019_2819_Fig21_HTML.jpg

相似文献

1
String correction using the Damerau-Levenshtein distance.使用 Damerau-Levenshtein 距离进行字符串校正。
BMC Bioinformatics. 2019 Jun 6;20(Suppl 11):277. doi: 10.1186/s12859-019-2819-0.
2
Linear space string correction algorithm using the Damerau-Levenshtein distance.基于 Damerau-Levenshtein 距离的线性空间字符串校正算法。
BMC Bioinformatics. 2020 Dec 9;21(Suppl 1):4. doi: 10.1186/s12859-019-3184-8.
3
Improved algorithms for approximate string matching (extended abstract).用于近似字符串匹配的改进算法(扩展摘要)。
BMC Bioinformatics. 2009 Jan 30;10 Suppl 1(Suppl 1):S10. doi: 10.1186/1471-2105-10-S1-S10.
4
Efficient sequential and parallel algorithms for finding edit distance based motifs.用于查找基于编辑距离的基序的高效顺序和并行算法。
BMC Genomics. 2016 Aug 18;17 Suppl 4(Suppl 4):465. doi: 10.1186/s12864-016-2789-9.
5
Where No Universal Health Care Identifier Exists: Comparison and Determination of the Utility of Score-Based Persons Matching Algorithms Using Demographic Data.在不存在通用医疗保健标识符的情况下:使用人口统计学数据对基于分数的人员匹配算法的效用进行比较和判定。
JMIR Public Health Surveill. 2018 Dec 13;4(4):e10436. doi: 10.2196/10436.
6
An efficient rank based approach for closest string and closest substring.一种基于排序的高效字符串和子串最近距离方法。
PLoS One. 2012;7(6):e37576. doi: 10.1371/journal.pone.0037576. Epub 2012 Jun 4.
7
Secure approximation of edit distance on genomic data.基因组数据编辑距离的安全近似值。
BMC Med Genomics. 2017 Jul 26;10(Suppl 2):41. doi: 10.1186/s12920-017-0279-9.
8
libFLASM: a software library for fixed-length approximate string matching.libFLASM:一个用于固定长度近似字符串匹配的软件库。
BMC Bioinformatics. 2016 Nov 10;17(1):454. doi: 10.1186/s12859-016-1320-2.
9
Fast exact algorithms for the closest string and substring problems with application to the planted (L, d)-motif model.快速精确算法求解最接近字符串和子字符串问题及其在 (L, d)-基序模型中的应用。
IEEE/ACM Trans Comput Biol Bioinform. 2011 Sep-Oct;8(5):1400-10. doi: 10.1109/TCBB.2011.21.
10
Interpreting Sequence-Levenshtein distance for determining error type and frequency between two embedded sequences of equal length.解释序列莱文斯坦距离以确定两个等长嵌入序列之间的错误类型和频率。
ArXiv. 2023 Oct 19:arXiv:2310.12833v1.

引用本文的文献

1
Leveraging social media data to study disease and treatment characteristics of Hodgkin's lymphoma Using Natural Language Processing methods.利用社交媒体数据,采用自然语言处理方法研究霍奇金淋巴瘤的疾病及治疗特征。
PLOS Digit Health. 2025 Mar 19;4(3):e0000765. doi: 10.1371/journal.pdig.0000765. eCollection 2025 Mar.
2
Molecular Mechanisms of Immunoinflammatory Infiltration and Ferroptosis in Arthritis Revealed by a Combination of Bioinformatics and Single-Cell Analysis.生物信息学与单细胞分析相结合揭示关节炎中免疫炎症浸润和铁死亡的分子机制
J Inflamm Res. 2025 Feb 18;18:2409-2432. doi: 10.2147/JIR.S503618. eCollection 2025.
3

本文引用的文献

1
The RNase H-like superfamily: new members, comparative structural analysis and evolutionary classification.RNase H 样超家族:新成员、比较结构分析和进化分类。
Nucleic Acids Res. 2014 Apr;42(7):4160-79. doi: 10.1093/nar/gkt1414. Epub 2014 Jan 23.
2
Pairwise Sequence Alignment for Very Long Sequences on GPUs.GPU 上超长序列的成对序列比对
IEEE Int Conf Comput Adv Bio Med Sci. 2012. doi: 10.1109/ICCABS.2012.6182641.
3
Identification of common molecular subsequences.常见分子子序列的鉴定
Single-Cell Analysis Integrated With Machine Learning Elucidates the Mechanisms of Nucleus Pulposus Cells Apoptosis in Intervertebral Disc Degeneration and Therapeutic Interventions.
结合机器学习的单细胞分析阐明椎间盘退变中髓核细胞凋亡机制及治疗干预措施
JOR Spine. 2025 Jan 20;8(1):e70036. doi: 10.1002/jsp2.70036. eCollection 2025 Mar.
4
Protein features fusion using attributed network embedding for predicting protein-protein interaction.使用属性网络嵌入进行蛋白质特征融合,以预测蛋白质-蛋白质相互作用。
BMC Genomics. 2024 May 13;25(1):466. doi: 10.1186/s12864-024-10361-8.
5
Cross-domain information fusion and personalized recommendation in artificial intelligence recommendation system based on mathematical matrix decomposition.基于数学矩阵分解的人工智能推荐系统中的跨域信息融合与个性化推荐
Sci Rep. 2024 Apr 3;14(1):7816. doi: 10.1038/s41598-024-57240-6.
6
On the semantic representation of risk.论风险的语义表征
Sci Adv. 2022 Jul 8;8(27):eabm1883. doi: 10.1126/sciadv.abm1883.
7
Glioma Subtypes Based on the Activity Changes of Immunologic and Hallmark Gene Sets in Cancer.基于癌症中免疫和特征基因集活性变化的胶质瘤亚型
Front Endocrinol (Lausanne). 2022 Jun 13;13:879233. doi: 10.3389/fendo.2022.879233. eCollection 2022.
8
Integrative bioinformatics and experimental analysis revealed TEAD as novel prognostic target for hepatocellular carcinoma and its roles in ferroptosis regulation.整合生物信息学和实验分析揭示了 TEAD 是肝细胞癌的新型预后靶点及其在铁死亡调控中的作用。
Aging (Albany NY). 2022 Jan 25;14(2):961-974. doi: 10.18632/aging.203853.
9
Identification of a five-miRNA signature as a novel potential prognostic biomarker in patients with nasopharyngeal carcinoma.鉴定五个 miRNA 特征作为鼻咽癌患者新型潜在预后生物标志物。
Hereditas. 2022 Jan 8;159(1):3. doi: 10.1186/s41065-021-00214-9.
10
Internal state effects on behavioral shifts in freely behaving praying mantises (Tenodera sinensis).内部状态对自由活动的螳螂(中华大刀螳)行为转变的影响。
PLoS Comput Biol. 2021 Dec 20;17(12):e1009618. doi: 10.1371/journal.pcbi.1009618. eCollection 2021 Dec.
J Mol Biol. 1981 Mar 25;147(1):195-7. doi: 10.1016/0022-2836(81)90087-5.
4
A general method applicable to the search for similarities in the amino acid sequence of two proteins.一种适用于寻找两种蛋白质氨基酸序列相似性的通用方法。
J Mol Biol. 1970 Mar;48(3):443-53. doi: 10.1016/0022-2836(70)90057-4.
5
Optimal alignments in linear space.线性空间中的最优比对
Comput Appl Biosci. 1988 Mar;4(1):11-7. doi: 10.1093/bioinformatics/4.1.11.