重新审视正则语言约束序列比对

Regular language constrained sequence alignment revisited.

作者信息

Kucherov Gregory, Pinhas Tamar, Ziv-Ukelson Michal

机构信息

LIFL/CNRS and INRIA Lille Nord-Europe, Villeneuve d'Ascq, France.

出版信息

J Comput Biol. 2011 May;18(5):771-81. doi: 10.1089/cmb.2010.0291.

DOI:10.1089/cmb.2010.0291

PMID:21554020

Abstract

Imposing constraints in the form of a finite automaton or a regular expression is an effective way to incorporate additional a priori knowledge into sequence alignment procedures. With this motivation, the Regular Expression Constrained Sequence Alignment Problem was introduced, which proposed an O(n²t⁴) time and O(n²t²) space algorithm for solving it, where n is the length of the input strings and t is the number of states in the input non-deterministic automaton. A faster O(n²t³) time algorithm for the same problem was subsequently proposed. In this article, we further speed up the algorithms for Regular Language Constrained Sequence Alignment by reducing their worst case time complexity bound to O(n²t³)/log t). This is done by establishing an optimal bound on the size of Straight-Line Programs solving the maxima computation subproblem of the basic dynamic programming algorithm. We also study another solution based on a Steiner Tree computation. While it does not improve the worst case, our simulations show that both approaches are efficient in practice, especially when the input automata are dense.

摘要

以有限自动机或正则表达式的形式施加约束是将额外的先验知识纳入序列比对过程的有效方法。出于这一动机，引入了正则表达式约束序列比对问题，并提出了一种用于解决该问题的时间复杂度为O(n²t⁴)且空间复杂度为O(n²t²)的算法，其中n是输入字符串的长度，t是输入非确定性自动机中的状态数。随后针对同一问题提出了一种更快的时间复杂度为O(n²t³)的算法。在本文中，我们通过将正则语言约束序列比对算法的最坏情况时间复杂度界限降低到O(n²t³)/log t)，进一步加快了这些算法的速度。这是通过为解决基本动态规划算法的最大值计算子问题的直线程序的大小建立一个最优界限来实现的。我们还研究了另一种基于斯坦纳树计算的解决方案。虽然它没有改善最坏情况，但我们的模拟表明，这两种方法在实践中都是有效的，特别是当输入自动机密集时。