Department of Computer and Information Science and Engineering, University of Florida, Gainesville, 32611, FL, USA.
BMC Bioinformatics. 2019 Jun 6;20(Suppl 11):277. doi: 10.1186/s12859-019-2819-0.
In the string correction problem, we are to transform one string into another using a set of prescribed edit operations. In string correction using the Damerau-Levenshtein (DL) distance, the permissible edit operations are: substitution, insertion, deletion and transposition. Several algorithms for string correction using the DL distance have been proposed. The fastest and most space efficient of these algorithms is due to Lowrance and Wagner. It computes the DL distance between strings of length m and n, respectively, in O(mn) time and O(mn) space. In this paper, we focus on the development of algorithms whose asymptotic space complexity is less and whose actual runtime and energy consumption are less than those of the algorithm of Lowrance and Wagner.
We develop space- and cache-efficient algorithms to compute the Damerau-Levenshtein (DL) distance between two strings as well as to find a sequence of edit operations of length equal to the DL distance. Our algorithms require O(s min{m,n}+m+n) space, where s is the size of the alphabet and m and n are, respectively, the lengths of the two strings. Previously known algorithms require O(mn) space. The space- and cache-efficient algorithms of this paper are demonstrated, experimentally, to be superior to earlier algorithms for the DL distance problem on time, space, and enery metrics using three different computational platforms.
Our benchmarking shows that, our algorithms are able to handle much larger sequences than earlier algorithms due to the reduction in space requirements. On a single core, we are able to compute the DL distance and an optimal edit sequence faster than known algorithms by as much as 73.1% and 63.5%, respectively. Further, we reduce energy consumption by as much as 68.5%. Multicore versions of our algorithms achieve a speedup of 23.2 on 24 cores.
在字符串纠错问题中,我们需要使用一组规定的编辑操作将一个字符串转换为另一个字符串。在使用 Damerau-Levenshtein(DL)距离的字符串纠错中,允许的编辑操作有:替换、插入、删除和转置。已经提出了几种使用 DL 距离的字符串纠错算法。其中最快和最节省空间的算法是 Lowrance 和 Wagner 提出的算法。该算法分别在 O(mn)时间和 O(mn)空间内计算两个长度分别为 m 和 n 的字符串之间的 DL 距离。在本文中,我们专注于开发其渐近空间复杂度更小、实际运行时和能耗都小于 Lowrance 和 Wagner 算法的算法。
我们开发了空间和缓存高效的算法来计算两个字符串之间的 Damerau-Levenshtein(DL)距离以及找到长度等于 DL 距离的编辑操作序列。我们的算法需要 O(s min{m,n}+m+n)空间,其中 s 是字母表的大小,m 和 n 分别是两个字符串的长度。之前已知的算法需要 O(mn)空间。本文的空间和缓存高效算法在三个不同的计算平台上,在时间、空间和能量指标方面,都优于之前的 DL 距离问题算法。
我们的基准测试表明,由于空间需求的减少,我们的算法能够处理比之前算法大得多的序列。在单核上,我们能够比已知算法更快地计算 DL 距离和最优编辑序列,分别快 73.1%和 63.5%。此外,我们还降低了 68.5%的能耗。我们的算法的多核版本在 24 核上实现了 23.2 的加速比。