快速精确算法求解最接近字符串和子字符串问题及其在 (L, d)-基序模型中的应用。

Fast exact algorithms for the closest string and substring problems with application to the planted (L, d)-motif model.

机构信息

Department of Mathematical Sciences, Tokyo Denki University, Hatomaya, Saitama 350-0394, Japan.

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2011 Sep-Oct;8(5):1400-10. doi: 10.1109/TCBB.2011.21.

DOI:10.1109/TCBB.2011.21

Abstract

We present two parameterized algorithms for the closest string problem. The first runs in O(nL + nd · 17.97d) time for DNA strings and in O(nL + nd · 61.86d) time for protein strings, where n is the number of input strings, L is the length of each input string, and d is the given upper bound on the number of mismatches between the center string and each input string. The second runs in O(nL + nd · 13.92d) time for DNA strings and in O(nL + nd · 47.21d) time for protein strings. We then extend the first algorithm to a new parameterized algorithm for the closest substring problem that runs in O((n - 1)m2(L + d · 17.97d · m[log2(d+1)])) time for DNA strings and in O((n - 1)m2(L + d · 61.86d · m[log2(d+1)])) time for protein strings, where n is the number of input strings, L is the length of the center substring, L - 1 + m is the maximum length of a single input string, and d is the given upper bound on the number of mismatches between the center substring and at least one substring of each input string. All the algorithms significantly improve the previous bests. To verify experimentally the theoretical improvements in the time complexity, we implement our algorithm in C and apply the resulting program to the planted (L, d)-motif problem proposed by Pevzner and Sze in 2000. We compare our program with the previously best exact program for the problem, namely PMSPrune (designed by Davila et al. in 2007). Our experimental data show that our program runs faster for practical cases and also for several challenging cases. Our algorithm uses less memory too.

摘要

我们提出了两个参数化算法用于解决最近字符串问题。第一个算法对于 DNA 字符串的运行时间为 O(nL + nd · 17.97d)，对于蛋白质字符串的运行时间为 O(nL + nd · 61.86d)，其中 n 是输入字符串的数量，L 是每个输入字符串的长度，d 是中心字符串与每个输入字符串之间允许的最大错配数。第二个算法对于 DNA 字符串的运行时间为 O(nL + nd · 13.92d)，对于蛋白质字符串的运行时间为 O(nL + nd · 47.21d)。然后，我们将第一个算法扩展为一个新的参数化算法，用于解决最近子字符串问题，对于 DNA 字符串的运行时间为 O((n - 1)m2(L + d · 17.97d · m[log2(d+1)]))，对于蛋白质字符串的运行时间为 O((n - 1)m2(L + d · 61.86d · m[log2(d+1)]))，其中 n 是输入字符串的数量，L 是中心子字符串的长度，L - 1 + m 是单个输入字符串的最大长度，d 是中心子字符串与每个输入字符串的至少一个子字符串之间允许的最大错配数。所有算法都显著提高了之前的最佳结果。为了通过实验验证时间复杂度的理论改进，我们用 C 语言实现了我们的算法，并将得到的程序应用于 Pevzner 和 Sze 于 2000 年提出的种植 (L, d)-motif 问题。我们将我们的程序与该问题之前最好的精确程序 PMSPrune（由 Davila 等人于 2007 年设计）进行了比较。我们的实验数据表明，我们的程序在实际案例和几个具有挑战性的案例中运行得更快。我们的算法也使用了更少的内存。

相似文献

Fast exact algorithms for the closest string and substring problems with application to the planted (L, d)-motif model.

IEEE/ACM Trans Comput Biol Bioinform. 2011 Sep-Oct;8(5):1400-10. doi: 10.1109/TCBB.2011.21.

Improved Exact Enumerative Algorithms for the Planted (l, d)-Motif Search Problem.

IEEE/ACM Trans Comput Biol Bioinform. 2014 Mar-Apr;11(2):361-74. doi: 10.1109/TCBB.2014.2306842.

Efficient sequential and parallel algorithms for finding edit distance based motifs.

BMC Genomics. 2016 Aug 18;17 Suppl 4(Suppl 4):465. doi: 10.1186/s12864-016-2789-9.

On the hardness of counting and sampling center strings.

IEEE/ACM Trans Comput Biol Bioinform. 2012 Nov-Dec;9(6):1843-6. doi: 10.1109/TCBB.2012.84.

Fast and practical algorithms for planted (l, d) motif search.

IEEE/ACM Trans Comput Biol Bioinform. 2007 Oct-Dec;4(4):544-52. doi: 10.1109/TCBB.2007.70241.

PMS6: a fast algorithm for motif discovery.

Int J Bioinform Res Appl. 2014;10(4-5):369-83. doi: 10.1504/IJBRA.2014.062990.

An Efficient Exact Algorithm for the Motif Stem Search Problem over Large Alphabets.

IEEE/ACM Trans Comput Biol Bioinform. 2015 Mar-Apr;12(2):384-97. doi: 10.1109/TCBB.2014.2361668.

Experimental study of modified voting algorithm for planted (l,d)-motif problem.

Adv Exp Med Biol. 2010;680:65-73. doi: 10.1007/978-1-4419-5913-3_8.

Short superstrings and the structure of overlapping strings.

J Comput Biol. 1995 Summer;2(2):307-32. doi: 10.1089/cmb.1995.2.307.

Closest string with outliers.

BMC Bioinformatics. 2011 Feb 15;12 Suppl 1(Suppl 1):S55. doi: 10.1186/1471-2105-12-S1-S55.

引用本文的文献

A Review on Planted (, d) Motif Discovery Algorithms for Medical Diagnose.

Sensors (Basel). 2022 Feb 5;22(3):1204. doi: 10.3390/s22031204.

Review of Different Sequence Motif Finding Algorithms.

Avicenna J Med Biotechnol. 2019 Apr-Jun;11(2):130-148.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

快速精确算法求解最接近字符串和子字符串问题及其在 (L, d)-基序模型中的应用。

Fast exact algorithms for the closest string and substring problems with application to the planted (L, d)-motif model.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献