Department of Mathematical Sciences, Tokyo Denki University, Hatomaya, Saitama 350-0394, Japan.
IEEE/ACM Trans Comput Biol Bioinform. 2011 Sep-Oct;8(5):1400-10. doi: 10.1109/TCBB.2011.21.
We present two parameterized algorithms for the closest string problem. The first runs in O(nL + nd · 17.97d) time for DNA strings and in O(nL + nd · 61.86d) time for protein strings, where n is the number of input strings, L is the length of each input string, and d is the given upper bound on the number of mismatches between the center string and each input string. The second runs in O(nL + nd · 13.92d) time for DNA strings and in O(nL + nd · 47.21d) time for protein strings. We then extend the first algorithm to a new parameterized algorithm for the closest substring problem that runs in O((n - 1)m2(L + d · 17.97d · m[log2(d+1)])) time for DNA strings and in O((n - 1)m2(L + d · 61.86d · m[log2(d+1)])) time for protein strings, where n is the number of input strings, L is the length of the center substring, L - 1 + m is the maximum length of a single input string, and d is the given upper bound on the number of mismatches between the center substring and at least one substring of each input string. All the algorithms significantly improve the previous bests. To verify experimentally the theoretical improvements in the time complexity, we implement our algorithm in C and apply the resulting program to the planted (L, d)-motif problem proposed by Pevzner and Sze in 2000. We compare our program with the previously best exact program for the problem, namely PMSPrune (designed by Davila et al. in 2007). Our experimental data show that our program runs faster for practical cases and also for several challenging cases. Our algorithm uses less memory too.
我们提出了两个参数化算法用于解决最近字符串问题。第一个算法对于 DNA 字符串的运行时间为 O(nL + nd · 17.97d),对于蛋白质字符串的运行时间为 O(nL + nd · 61.86d),其中 n 是输入字符串的数量,L 是每个输入字符串的长度,d 是中心字符串与每个输入字符串之间允许的最大错配数。第二个算法对于 DNA 字符串的运行时间为 O(nL + nd · 13.92d),对于蛋白质字符串的运行时间为 O(nL + nd · 47.21d)。然后,我们将第一个算法扩展为一个新的参数化算法,用于解决最近子字符串问题,对于 DNA 字符串的运行时间为 O((n - 1)m2(L + d · 17.97d · m[log2(d+1)])),对于蛋白质字符串的运行时间为 O((n - 1)m2(L + d · 61.86d · m[log2(d+1)])),其中 n 是输入字符串的数量,L 是中心子字符串的长度,L - 1 + m 是单个输入字符串的最大长度,d 是中心子字符串与每个输入字符串的至少一个子字符串之间允许的最大错配数。所有算法都显著提高了之前的最佳结果。为了通过实验验证时间复杂度的理论改进,我们用 C 语言实现了我们的算法,并将得到的程序应用于 Pevzner 和 Sze 于 2000 年提出的种植 (L, d)-motif 问题。我们将我们的程序与该问题之前最好的精确程序 PMSPrune(由 Davila 等人于 2007 年设计)进行了比较。我们的实验数据表明,我们的程序在实际案例和几个具有挑战性的案例中运行得更快。我们的算法也使用了更少的内存。