Levitt M, Gerstein M
Department of Structural Biology, Stanford University, Stanford, CA 94305, USA.
Proc Natl Acad Sci U S A. 1998 May 26;95(11):5913-20. doi: 10.1073/pnas.95.11.5913.
We present an approach for assessing the significance of sequence and structure comparisons by using nearly identical statistical formalisms for both sequence and structure. Doing so involves an all-vs.-all comparison of protein domains [taken here from the Structural Classification of Proteins (scop) database] and then fitting a simple distribution function to the observed scores. By using this distribution, we can attach a statistical significance to each comparison score in the form of a P value, the probability that a better score would occur by chance. As expected, we find that the scores for sequence matching follow an extreme-value distribution. The agreement, moreover, between the P values that we derive from this distribution and those reported by standard programs (e.g., BLAST and FASTA validates our approach. Structure comparison scores also follow an extreme-value distribution when the statistics are expressed in terms of a structural alignment score (essentially the sum of reciprocated distances between aligned atoms minus gap penalties). We find that the traditional metric of structural similarity, the rms deviation in atom positions after fitting aligned atoms, follows a different distribution of scores and does not perform as well as the structural alignment score. Comparison of the sequence and structure statistics for pairs of proteins known to be related distantly shows that structural comparison is able to detect approximately twice as many distant relationships as sequence comparison at the same error rate. The comparison also indicates that there are very few pairs with significant similarity in terms of sequence but not structure whereas many pairs have significant similarity in terms of structure but not sequence.
我们提出了一种方法,通过对序列和结构使用几乎相同的统计形式来评估序列和结构比较的显著性。这样做涉及对蛋白质结构域进行全对全比较(此处取自蛋白质结构分类数据库),然后将一个简单的分布函数拟合到观察到的分数上。通过使用这种分布,我们可以以P值的形式为每个比较分数赋予统计显著性,即偶然获得更好分数的概率。正如预期的那样,我们发现序列匹配的分数遵循极值分布。此外,我们从该分布得出的P值与标准程序(如BLAST和FASTA)报告的P值之间的一致性验证了我们的方法。当统计以结构比对分数表示时(本质上是比对原子之间的倒数距离之和减去空位罚分),结构比较分数也遵循极值分布。我们发现,传统的结构相似性度量,即拟合比对原子后原子位置的均方根偏差,遵循不同的分数分布,并且不如结构比对分数表现好。对已知远缘相关的蛋白质对的序列和结构统计进行比较表明,在相同错误率下,结构比较能够检测到的远缘关系数量大约是序列比较的两倍。该比较还表明,在序列方面有显著相似性但在结构方面没有的蛋白质对非常少,而在结构方面有显著相似性但在序列方面没有的蛋白质对有很多。