Suppr超能文献

快速精确算法求解最接近字符串和子字符串问题及其在 (L, d)-基序模型中的应用。

Fast exact algorithms for the closest string and substring problems with application to the planted (L, d)-motif model.

机构信息

Department of Mathematical Sciences, Tokyo Denki University, Hatomaya, Saitama 350-0394, Japan.

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2011 Sep-Oct;8(5):1400-10. doi: 10.1109/TCBB.2011.21.

Abstract

We present two parameterized algorithms for the closest string problem. The first runs in O(nL + nd · 17.97d) time for DNA strings and in O(nL + nd · 61.86d) time for protein strings, where n is the number of input strings, L is the length of each input string, and d is the given upper bound on the number of mismatches between the center string and each input string. The second runs in O(nL + nd · 13.92d) time for DNA strings and in O(nL + nd · 47.21d) time for protein strings. We then extend the first algorithm to a new parameterized algorithm for the closest substring problem that runs in O((n - 1)m2(L + d · 17.97d · m[log2(d+1)])) time for DNA strings and in O((n - 1)m2(L + d · 61.86d · m[log2(d+1)])) time for protein strings, where n is the number of input strings, L is the length of the center substring, L - 1 + m is the maximum length of a single input string, and d is the given upper bound on the number of mismatches between the center substring and at least one substring of each input string. All the algorithms significantly improve the previous bests. To verify experimentally the theoretical improvements in the time complexity, we implement our algorithm in C and apply the resulting program to the planted (L, d)-motif problem proposed by Pevzner and Sze in 2000. We compare our program with the previously best exact program for the problem, namely PMSPrune (designed by Davila et al. in 2007). Our experimental data show that our program runs faster for practical cases and also for several challenging cases. Our algorithm uses less memory too.

摘要

我们提出了两个参数化算法用于解决最近字符串问题。第一个算法对于 DNA 字符串的运行时间为 O(nL + nd · 17.97d),对于蛋白质字符串的运行时间为 O(nL + nd · 61.86d),其中 n 是输入字符串的数量,L 是每个输入字符串的长度,d 是中心字符串与每个输入字符串之间允许的最大错配数。第二个算法对于 DNA 字符串的运行时间为 O(nL + nd · 13.92d),对于蛋白质字符串的运行时间为 O(nL + nd · 47.21d)。然后,我们将第一个算法扩展为一个新的参数化算法,用于解决最近子字符串问题,对于 DNA 字符串的运行时间为 O((n - 1)m2(L + d · 17.97d · m[log2(d+1)])),对于蛋白质字符串的运行时间为 O((n - 1)m2(L + d · 61.86d · m[log2(d+1)])),其中 n 是输入字符串的数量,L 是中心子字符串的长度,L - 1 + m 是单个输入字符串的最大长度,d 是中心子字符串与每个输入字符串的至少一个子字符串之间允许的最大错配数。所有算法都显著提高了之前的最佳结果。为了通过实验验证时间复杂度的理论改进,我们用 C 语言实现了我们的算法,并将得到的程序应用于 Pevzner 和 Sze 于 2000 年提出的种植 (L, d)-motif 问题。我们将我们的程序与该问题之前最好的精确程序 PMSPrune(由 Davila 等人于 2007 年设计)进行了比较。我们的实验数据表明,我们的程序在实际案例和几个具有挑战性的案例中运行得更快。我们的算法也使用了更少的内存。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验