Department of Computer and Information Science and Engineering, University of Florida, Gainesville, 32611, FL, USA.
BMC Bioinformatics. 2020 Dec 9;21(Suppl 1):4. doi: 10.1186/s12859-019-3184-8.
The Damerau-Levenshtein (DL) distance metric has been widely used in the biological science. It tries to identify the similar region of DNA,RNA and protein sequences by transforming one sequence to the another using the substitution, insertion, deletion and transposition operations. Lowrance and Wagner have developed an O(mn) time O(mn) space algorithm to find the minimum cost edit sequence between strings of length m and n, respectively. In our previous research, we have developed algorithms that run in O(mn) time using only O(s∗min{m,n}+m+n) space, where s is the size of the alphabet comprising the strings, to compute the DL distance as well as the corresponding edit sequence. These are so far the fastest and most space efficient algorithms. In this paper, we focus on the development of algorithms whose asymptotic space complexity is linear.
We develop linear space algorithms to compute the Damerau-Levenshtein (DL) distance between two strings and determine the optimal trace (corresponding edit operations.)Extensive experiments conducted on three computational platforms-Xeon E5 2603, I7-x980 and Xeon E5 2695-show that, our algorithms, in addition to using less space, are much faster than earlier algorithms.
Besides using less space than the previously known algorithms,significant run-time improvement was seen for our new algorithms on all three of our experimental platforms. On all platforms, our linear-space cache-efficient algorithms reduced run time by as much as 56.4% and 57.4% in respect to compute the DL distance and an optimal edit sequences compared to previous algorithms. Our multi-core algorithms reduced the run time by up to 59.3% compared to the best previously known multi-core algorithms.
Damerau-Levenshtein(DL)距离度量已广泛应用于生物科学领域。它试图通过使用替换、插入、删除和转位操作将一个序列转换为另一个序列,从而识别 DNA、RNA 和蛋白质序列的相似区域。Lowrance 和 Wagner 开发了一种 O(mn)时间 O(mn)空间算法,分别用于找到长度为 m 和 n 的两个字符串之间的最小代价编辑序列。在我们之前的研究中,我们开发了一种在 O(mn)时间内运行的算法,仅使用 O(s∗min{m,n}+m+n)空间,其中 s 是字符串所包含的字母表的大小,来计算 DL 距离以及相应的编辑序列。到目前为止,这些算法是最快和最节省空间的算法。在本文中,我们专注于开发其渐近空间复杂度为线性的算法。
我们开发了线性空间算法来计算两个字符串之间的 Damerau-Levenshtein(DL)距离,并确定最优跟踪(对应编辑操作)。在三个计算平台上——Xeon E5 2603、I7-x980 和 Xeon E5 2695——进行了广泛的实验,结果表明,我们的算法除了使用更少的空间外,在速度上也比早期算法快得多。
除了使用的空间比以前已知的算法少之外,我们的新算法在我们的所有三个实验平台上都显著提高了运行时间。在所有平台上,与之前的算法相比,我们的线性空间缓存高效算法在计算 DL 距离和最优编辑序列时将运行时间分别缩短了 56.4%和 57.4%。与之前已知的最佳多核算法相比,我们的多核算法将运行时间缩短了高达 59.3%。