Bailey T L, Gribskov M
San Diego Supercomputer Center, California 92186-9784, USA.
J Comput Biol. 1997 Spring;4(1):45-59. doi: 10.1089/cmb.1997.4.45.
Several computer algorithms now exist for discovering multiple motifs (expressed as weight matrices) that characterize a family of protein sequences known to be homologous. This paper describes a method for performing similarity searches of protein sequence databases using such a group of motifs. By simultaneously using all the motifs that characterize a protein family, the sensitivity and specificity of the database search are increased. We define the p-value for a target sequence to be the probability of a random sequence of the same length scoring as well or better in comparison to all the motifs that characterize the family. (The p-value of a database search can be determined from this value and the size of the database.) We show that estimating the distribution of single motif scores by a Gaussian extreme value distribution is insufficiently accurate to provide a useful estimate of the p-value, but that this deficiency can be corrected by reestimating the parameters of the underlying Gaussian distribution from observed scores for comparison of a given motif and sequence database. These parameters are used to calculate a "reduced variate" which has a Gumbel limiting distribution. Multiple motif scores are combined to give a single p-value by using the sum of the reduced variates for the motif scores as the test statistic. We give a computationally efficient approximation to the distribution of the sum of independent Gumbel random variables and verify experimentally that it closely approximates the distribution of the test statistic. Experiments on pseudorandom sequences show that the approximated p-values are conservative, so the significance of high scores in database searches will not be overstated. Experiments with real protein sequences and motifs identified by the MEME algorithm show that determining an overall p-value based on the combination of multiple motifs gives significantly better database search results than using p-values of single motifs.
现在有几种计算机算法可用于发现多个基序(表示为权重矩阵),这些基序可表征已知同源的蛋白质序列家族。本文描述了一种使用这样一组基序对蛋白质序列数据库进行相似性搜索的方法。通过同时使用表征蛋白质家族的所有基序,数据库搜索的灵敏度和特异性得以提高。我们将目标序列的p值定义为与表征该家族的所有基序相比,相同长度的随机序列得分相同或更高的概率。(数据库搜索的p值可根据此值和数据库大小确定。)我们表明,通过高斯极值分布估计单个基序得分的分布不足以准确提供p值的有用估计,但通过根据给定基序与序列数据库比较的观察得分重新估计基础高斯分布的参数,可以纠正这一缺陷。这些参数用于计算具有耿贝尔极限分布的“约化变量”。通过使用基序得分的约化变量之和作为检验统计量,将多个基序得分组合以给出单个p值。我们给出了独立耿贝尔随机变量之和分布的计算高效近似,并通过实验验证它与检验统计量的分布非常接近。对伪随机序列的实验表明,近似p值是保守的,因此不会高估数据库搜索中高分的显著性。使用MEME算法鉴定的真实蛋白质序列和基序进行的实验表明,基于多个基序的组合确定总体p值比使用单个基序的p值能给出明显更好的数据库搜索结果。