Department of Medical School, Center for Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, MI 48109, USA.
Bioinformatics. 2010 Apr 1;26(7):889-95. doi: 10.1093/bioinformatics/btq066. Epub 2010 Feb 17.
Protein structure similarity is often measured by root mean squared deviation, global distance test score and template modeling score (TM-score). However, the scores themselves cannot provide information on how significant the structural similarity is. Also, it lacks a quantitative relation between the scores and conventional fold classifications. This article aims to answer two questions: (i) what is the statistical significance of TM-score? (ii) What is the probability of two proteins having the same fold given a specific TM-score?
We first made an all-to-all gapless structural match on 6684 non-homologous single-domain proteins in the PDB and found that the TM-scores follow an extreme value distribution. The data allow us to assign each TM-score a P-value that measures the chance of two randomly selected proteins obtaining an equal or higher TM-score. With a TM-score at 0.5, for instance, its P-value is 5.5 x 10(-7), which means we need to consider at least 1.8 million random protein pairs to acquire a TM-score of no less than 0.5. Second, we examine the posterior probability of the same fold proteins from three datasets SCOP, CATH and the consensus of SCOP and CATH. It is found that the posterior probability from different datasets has a similar rapid phase transition around TM-score=0.5. This finding indicates that TM-score can be used as an approximate but quantitative criterion for protein topology classification, i.e. protein pairs with a TM-score >0.5 are mostly in the same fold while those with a TM-score <0.5 are mainly not in the same fold.
蛋白质结构相似性通常通过均方根偏差、全局距离测试得分和模板建模得分(TM 得分)来衡量。然而,这些分数本身并不能提供结构相似性的显著程度的信息。此外,它缺乏分数与常规折叠分类之间的定量关系。本文旨在回答两个问题:(i)TM 得分的统计显著性如何?(ii)给定特定的 TM 得分,两个蛋白质具有相同折叠的概率是多少?
我们首先在 PDB 中对 6684 个非同源单域蛋白进行了全对全无间隙结构匹配,发现 TM 得分遵循极值分布。该数据允许我们为每个 TM 得分分配一个 P 值,该 P 值衡量随机选择的两个蛋白质获得相等或更高 TM 得分的机会。例如,TM 得分为 0.5 时,其 P 值为 5.5×10(-7),这意味着我们需要考虑至少 180 万个随机蛋白质对才能获得不低于 0.5 的 TM 得分。其次,我们检查了来自 SCOP、CATH 和 SCOP 和 CATH 共识的三个数据集的相同折叠蛋白质的后验概率。发现不同数据集的后验概率在 TM 得分=0.5 附近具有相似的快速相变。这一发现表明,TM 得分可以用作蛋白质拓扑分类的近似但定量标准,即 TM 得分>0.5 的蛋白质对主要处于相同折叠,而 TM 得分<0.5 的蛋白质对主要不处于相同折叠。