Myers E W, Miller W
Bull Math Biol. 1989;51(1):5-37. doi: 10.1007/BF02458834.
Given a sequence A and regular expression R, the approximate regular expression matching problem is to find a sequence matching R whose optimal alignment with A is the highest scoring of all such sequences. This paper develops an algorithm to solve the problem in time O(MN), where M and N are the lengths of A and R. Thus, the time requirement is asymptotically no worse than for the simpler problem of aligning two fixed sequences. Our method is superior to an earlier algorithm by Wagner and Seiferas in several ways. First, it treats real-valued costs, in addition to integer costs, with no loss of asymptotic efficiency. Second, it requires only O(N) space to deliver just the score of the best alignment. Finally, its structure permits implementation techniques that make it extremely fast in practice. We extend the method to accommodate gap penalties, as required for typical applications in molecular biology, and further refine it to search for sub-strings of A that strongly align with a sequence in R, as required for typical data base searches. We also show how to deliver an optimal alignment between A and R in only O(N + log M) space using O(MN log M) time. Finally, an O(MN(M + N) + N2log N) time algorithm is presented for alignment scoring schemes where the cost of a gap is an arbitrary increasing function of its length.
给定一个序列A和正则表达式R,近似正则表达式匹配问题是要找到一个与R匹配的序列,该序列与A的最优比对是所有此类序列中得分最高的。本文开发了一种算法,能在O(MN)时间内解决该问题,其中M和N分别是A和R的长度。因此,时间要求在渐近意义上不比比对两个固定序列的简单问题更差。我们的方法在几个方面优于Wagner和Seiferas早期的算法。首先,除了整数值代价外,它还能处理实数值代价,且不损失渐近效率。其次,它只需要O(N)的空间就能给出最佳比对的得分。最后,其结构允许采用一些实现技术,使其在实际应用中非常快速。我们扩展该方法以适应分子生物学典型应用所需的空位罚分,并进一步改进它以搜索A中与R中的序列强烈比对的子串,这是典型数据库搜索所需要的。我们还展示了如何使用O(MN log M)时间,仅在O(N + log M)空间内给出A和R之间的最优比对。最后,针对空位代价是其长度的任意递增函数的比对计分方案,给出了一个O(MN(M + N) + N2log N)时间的算法。