Raghava G P S, Barton Geoffrey J
School of Life Sciences Research, University of Dundee, Dow Street, Dundee, DD1 5EH, Scotland, UK.
BMC Bioinformatics. 2006 Sep 19;7:415. doi: 10.1186/1471-2105-7-415.
Percentage Identity (PID) is frequently quoted in discussion of sequence alignments since it appears simple and easy to understand. However, although there are several different ways to calculate percentage identity and each may yield a different result for the same alignment, the method of calculation is rarely reported. Accordingly, quantification of the variation in PID caused by the different calculations would help in interpreting PID values in the literature. In this study, the variation in PID was quantified systematically on a reference set of 1028 alignments generated by comparison of the protein three-dimensional structures. Since the alignment algorithm may also affect the range of PID, this study also considered the effect of algorithm, and the combination of algorithm and PID method.
The maximum variation in PID due to the calculation method was 11.5% while the effect of alignment algorithm on PID was up to 14.6% across three popular alignment methods. The combined effect of alignment algorithm and PID calculation gave a variation of up to 22% on the test data, with an average of 5.3% +/- 2.8% for sequence pairs with < 30% identity. In order to see which PID method was most highly correlated with structural similarity, four different PID calculations were compared to similarity scores (Sc) from the comparison of the corresponding protein three-dimensional structures. The highest correlation coefficient for a PID calculation was 0.80. In contrast, the more sophisticated Z-score calculated by reference to randomized sequences gave a correlation coefficient of 0.84.
Although it is well known amongst expert sequence analysts that PID is a poor score for discriminating between protein sequences, the apparent simplicity of the percentage identity score encourages its widespread use in establishing cutoffs for structural similarity. This paper illustrates that not only is PID a poor measure of sequence similarity when compared to the Z-score, but that there is also a large uncertainty in reported PID values. Since better alternatives to PID exist to quantify sequence similarity, these should be quoted where possible in preference to PID. The findings presented here should prove helpful to those new to sequence analysis, and in warning those who seek to interpret the value of a PID reported in the literature.
在序列比对的讨论中,经常会提到百分比一致性(PID),因为它看起来简单易懂。然而,尽管有几种不同的方法来计算百分比一致性,并且对于相同的比对,每种方法可能会产生不同的结果,但计算方法却很少被报道。因此,量化由不同计算引起的PID变化将有助于解释文献中的PID值。在本研究中,我们对通过比较蛋白质三维结构生成的1028个比对的参考集系统地量化了PID的变化。由于比对算法也可能影响PID的范围,本研究还考虑了算法的影响以及算法与PID方法的组合。
由于计算方法导致的PID最大变化为11.5%,而在三种常用的比对方法中,比对算法对PID的影响高达14.6%。比对算法和PID计算的综合影响在测试数据上产生了高达22%的变化,对于同一性小于30%的序列对,平均为5.3%±2.8%。为了了解哪种PID方法与结构相似性相关性最高,我们将四种不同的PID计算与相应蛋白质三维结构比较的相似性得分(Sc)进行了比较。PID计算的最高相关系数为0.80。相比之下,通过参考随机序列计算的更复杂的Z分数的相关系数为0.84。
尽管在专业序列分析人员中众所周知,PID在区分蛋白质序列方面是一个较差的得分,但百分比一致性得分表面上的简单性促使其在建立结构相似性的阈值时被广泛使用。本文表明,与Z分数相比,PID不仅是序列相似性的一个较差的度量,而且报告的PID值也存在很大的不确定性。由于存在比PID更好的量化序列相似性的替代方法,因此在可能的情况下应优先引用这些方法而不是PID。此处呈现的研究结果应对序列分析新手有所帮助,并警示那些试图解释文献中报告的PID值的人。