• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于 Damerau-Levenshtein 距离的线性空间字符串校正算法。

Linear space string correction algorithm using the Damerau-Levenshtein distance.

机构信息

Department of Computer and Information Science and Engineering, University of Florida, Gainesville, 32611, FL, USA.

出版信息

BMC Bioinformatics. 2020 Dec 9;21(Suppl 1):4. doi: 10.1186/s12859-019-3184-8.

DOI:10.1186/s12859-019-3184-8
PMID:33297940
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7724867/
Abstract

BACKGROUND

The Damerau-Levenshtein (DL) distance metric has been widely used in the biological science. It tries to identify the similar region of DNA,RNA and protein sequences by transforming one sequence to the another using the substitution, insertion, deletion and transposition operations. Lowrance and Wagner have developed an O(mn) time O(mn) space algorithm to find the minimum cost edit sequence between strings of length m and n, respectively. In our previous research, we have developed algorithms that run in O(mn) time using only O(s∗min{m,n}+m+n) space, where s is the size of the alphabet comprising the strings, to compute the DL distance as well as the corresponding edit sequence. These are so far the fastest and most space efficient algorithms. In this paper, we focus on the development of algorithms whose asymptotic space complexity is linear.

RESULTS

We develop linear space algorithms to compute the Damerau-Levenshtein (DL) distance between two strings and determine the optimal trace (corresponding edit operations.)Extensive experiments conducted on three computational platforms-Xeon E5 2603, I7-x980 and Xeon E5 2695-show that, our algorithms, in addition to using less space, are much faster than earlier algorithms.

CONCLUSION

Besides using less space than the previously known algorithms,significant run-time improvement was seen for our new algorithms on all three of our experimental platforms. On all platforms, our linear-space cache-efficient algorithms reduced run time by as much as 56.4% and 57.4% in respect to compute the DL distance and an optimal edit sequences compared to previous algorithms. Our multi-core algorithms reduced the run time by up to 59.3% compared to the best previously known multi-core algorithms.

摘要

背景

Damerau-Levenshtein(DL)距离度量已广泛应用于生物科学领域。它试图通过使用替换、插入、删除和转位操作将一个序列转换为另一个序列,从而识别 DNA、RNA 和蛋白质序列的相似区域。Lowrance 和 Wagner 开发了一种 O(mn)时间 O(mn)空间算法,分别用于找到长度为 m 和 n 的两个字符串之间的最小代价编辑序列。在我们之前的研究中,我们开发了一种在 O(mn)时间内运行的算法,仅使用 O(s∗min{m,n}+m+n)空间,其中 s 是字符串所包含的字母表的大小,来计算 DL 距离以及相应的编辑序列。到目前为止,这些算法是最快和最节省空间的算法。在本文中,我们专注于开发其渐近空间复杂度为线性的算法。

结果

我们开发了线性空间算法来计算两个字符串之间的 Damerau-Levenshtein(DL)距离,并确定最优跟踪(对应编辑操作)。在三个计算平台上——Xeon E5 2603、I7-x980 和 Xeon E5 2695——进行了广泛的实验,结果表明,我们的算法除了使用更少的空间外,在速度上也比早期算法快得多。

结论

除了使用的空间比以前已知的算法少之外,我们的新算法在我们的所有三个实验平台上都显著提高了运行时间。在所有平台上,与之前的算法相比,我们的线性空间缓存高效算法在计算 DL 距离和最优编辑序列时将运行时间分别缩短了 56.4%和 57.4%。与之前已知的最佳多核算法相比,我们的多核算法将运行时间缩短了高达 59.3%。

相似文献

1
Linear space string correction algorithm using the Damerau-Levenshtein distance.基于 Damerau-Levenshtein 距离的线性空间字符串校正算法。
BMC Bioinformatics. 2020 Dec 9;21(Suppl 1):4. doi: 10.1186/s12859-019-3184-8.
2
String correction using the Damerau-Levenshtein distance.使用 Damerau-Levenshtein 距离进行字符串校正。
BMC Bioinformatics. 2019 Jun 6;20(Suppl 11):277. doi: 10.1186/s12859-019-2819-0.
3
Improved algorithms for approximate string matching (extended abstract).用于近似字符串匹配的改进算法(扩展摘要)。
BMC Bioinformatics. 2009 Jan 30;10 Suppl 1(Suppl 1):S10. doi: 10.1186/1471-2105-10-S1-S10.
4
Efficient sequential and parallel algorithms for finding edit distance based motifs.用于查找基于编辑距离的基序的高效顺序和并行算法。
BMC Genomics. 2016 Aug 18;17 Suppl 4(Suppl 4):465. doi: 10.1186/s12864-016-2789-9.
5
Where No Universal Health Care Identifier Exists: Comparison and Determination of the Utility of Score-Based Persons Matching Algorithms Using Demographic Data.在不存在通用医疗保健标识符的情况下:使用人口统计学数据对基于分数的人员匹配算法的效用进行比较和判定。
JMIR Public Health Surveill. 2018 Dec 13;4(4):e10436. doi: 10.2196/10436.
6
Efficient edit distance with duplications and contractions.带重复和收缩的高效编辑距离
Algorithms Mol Biol. 2013 Oct 29;8(1):27. doi: 10.1186/1748-7188-8-27.
7
A normalized Levenshtein distance metric.一种归一化的莱文斯坦距离度量。
IEEE Trans Pattern Anal Mach Intell. 2007 Jun;29(6):1091-5. doi: 10.1109/TPAMI.2007.1078.
8
Cache and energy efficient algorithms for Nussinov's RNA Folding.用于努西诺夫RNA折叠的缓存与节能算法。
BMC Bioinformatics. 2017 Dec 6;18(Suppl 15):518. doi: 10.1186/s12859-017-1917-0.
9
A Simple Linear Space Algorithm for Computing Nonoverlapping Inversion and Transposition Distance in Quadratic Average Time.
J Comput Biol. 2018 Jun;25(6):563-575. doi: 10.1089/cmb.2017.0257. Epub 2018 Apr 16.
10
An efficient rank based approach for closest string and closest substring.一种基于排序的高效字符串和子串最近距离方法。
PLoS One. 2012;7(6):e37576. doi: 10.1371/journal.pone.0037576. Epub 2012 Jun 4.

本文引用的文献

1
String correction using the Damerau-Levenshtein distance.使用 Damerau-Levenshtein 距离进行字符串校正。
BMC Bioinformatics. 2019 Jun 6;20(Suppl 11):277. doi: 10.1186/s12859-019-2819-0.
2
bcRep: R Package for Comprehensive Analysis of B Cell Receptor Repertoire Data.bcRep:用于B细胞受体库数据综合分析的R包。
PLoS One. 2016 Aug 23;11(8):e0161569. doi: 10.1371/journal.pone.0161569. eCollection 2016.
3
PR2S2Clust: Patched RNA-seq read segments' structure-oriented clustering.PR2S2Clust:基于patched RNA测序读段的面向结构的聚类
J Bioinform Comput Biol. 2016 Oct;14(5):1650027. doi: 10.1142/S021972001650027X. Epub 2016 Jul 25.
4
The RNase H-like superfamily: new members, comparative structural analysis and evolutionary classification.RNase H 样超家族:新成员、比较结构分析和进化分类。
Nucleic Acids Res. 2014 Apr;42(7):4160-79. doi: 10.1093/nar/gkt1414. Epub 2014 Jan 23.