Atteson K
Proc Int Conf Intell Syst Mol Biol. 1998;6:17-24.
We present algorithms for the exact computation of the probability that a random string of a certain length matches a given regular expression. These algorithms can be used to determine statistical significance in a variety of pattern searches such as motif searches and gene-finding. This work improves upon work of Kleffe and Langebacker (Kleffe & Langbecker 1990) and of Sewell and Durbin (Sewell & Durbin 1995) in several ways. First, in many cases of interest, the algorithms presented here are faster. In addition, the type of pattern considered here strictly includes those of both previous works but also allows, for instance, arbitrary length gaps. Also, the type of probability model which can be used is more general than that of Sewell and Durbin, allowing for Markov chains. The problem solved in this work is in fact in the class of NP-hard problems which are believed to be intractable. However, the problem is fixed-parameter tractable, meaning that it is tractable for small patterns. The is problem is also computationally feasible for many patterns which occur in practice. As a sample application, we consider calculating the statistical significance of most of the PROSITE patterns as in Sewell and Durbin. Whereas their method was only fast enough to exactly compute the probabilities for sequences of length 13 larger than the pattern length, we calculate these probabilities for sequences of up to length 2000. In addition, we calculate most of these probabilities using a first order Markov chain. Most of the PROSITE patterns have high significance at length 2000 under both the i.i.d. and Markov chain models. For further applications, we demonstrate the calculation of the probability of a PROSITE pattern occurring on either strand of a random DNA sequence of up to 500 kilo-bases and the probability of a simple gene model occurring in a random sequence of up to 1 megabase.
我们提出了一些算法,用于精确计算特定长度的随机字符串与给定正则表达式匹配的概率。这些算法可用于确定各种模式搜索(如基序搜索和基因查找)中的统计显著性。这项工作在几个方面改进了克莱费和兰格贝克(Kleffe & Langbecker,1990)以及休厄尔和德宾(Sewell & Durbin,1995)的工作。首先,在许多感兴趣的情况下,这里提出的算法速度更快。此外,这里考虑的模式类型不仅严格包含了之前两篇论文中的模式类型,还允许例如任意长度的间隔。而且,可以使用的概率模型类型比休厄尔和德宾的更通用,允许使用马尔可夫链。这项工作中解决的问题实际上属于NP难问题类别,一般认为这类问题难以处理。然而,该问题是固定参数可处理的,这意味着对于小模式来说它是可处理的。对于许多实际出现的模式,这个问题在计算上也是可行的。作为一个示例应用,我们像休厄尔和德宾那样考虑计算大多数PROSITE模式的统计显著性。他们的方法只够快速精确计算长度大于模式长度13的序列的概率,而我们计算了长度高达2000的序列的这些概率。此外,我们使用一阶马尔可夫链计算了这些概率中的大部分。在独立同分布和马尔可夫链模型下,大多数PROSITE模式在长度为2000时都具有高度显著性。对于进一步的应用,我们展示了计算长度高达500千碱基的随机DNA序列的任意一条链上出现PROSITE模式的概率,以及长度高达1兆碱基的随机序列中出现简单基因模型的概率。