Jain Pooja, Hirst Jonathan D
School of Chemistry, The University of Nottingham, University Park, Nottingham NG7 2RD, UK.
BMC Struct Biol. 2009 Sep 19;9:60. doi: 10.1186/1472-6807-9-60.
Classification of newly resolved protein structures is important in understanding their architectural, evolutionary and functional relatedness to known protein structures. Among various efforts to improve the database of Structural Classification of Proteins (SCOP), automation has received particular attention. Herein, we predict the deepest SCOP structural level that an unclassified protein shares with classified proteins with an equal number of secondary structure elements (SSEs).
We compute a coefficient of dissimilarity (Omega) between proteins, based on structural and sequence-based descriptors characterising the respective constituent SSEs. For a set of 1,661 pairs of proteins with sequence identity up to 35%, the performance of Omega in predicting shared Class, Fold and Super-family levels is comparable to that of DaliLite Z score and shows a greater than four-fold increase in the true positive rate (TPR) for proteins sharing the Family level. On a larger set of 600 domains representing 200 families, the performance of Z score improves in predicting a shared Family, but still only achieves about half of the TPR of Omega. The TPR for structures sharing a Super-family is lower than in the first dataset, but Omega performs slightly better than Z score. Overall, the sensitivity of Omega in predicting common Fold level is higher than that of the DaliLite Z score.
Classification to a deeper level in the hierarchy is specific and difficult. So the efficiency of Omega may be attractive to the curators and the end-users of SCOP. We suggest Omega may be a better measure for structure classification than the DaliLite Z score, with the caveat that currently we are restricted to comparing structures with equal number of SSEs.
新解析出的蛋白质结构分类对于理解其与已知蛋白质结构在架构、进化和功能上的相关性至关重要。在各种改进蛋白质结构分类数据库(SCOP)的努力中,自动化受到了特别关注。在此,我们预测一个未分类蛋白质与具有相同数量二级结构元件(SSE)的已分类蛋白质所共有的最深SCOP结构层次。
我们基于表征各个组成SSE的结构和序列描述符计算蛋白质之间的差异系数(Omega)。对于一组序列同一性高达35%的1661对蛋白质,Omega在预测共享的类、折叠和超家族层次方面的性能与DaliLite Z分数相当,并且对于共享家族层次的蛋白质,真阳性率(TPR)提高了四倍以上。在代表200个家族的600个结构域的更大集合上,Z分数在预测共享家族方面的性能有所提高,但仍仅达到Omega的TPR的约一半。共享超家族的结构的TPR低于第一个数据集,但Omega的表现略优于Z分数。总体而言,Omega在预测共同折叠层次方面的敏感性高于DaliLite Z分数。
在层次结构中进行更深入的分类既具体又困难。因此,Omega的效率可能对SCOP的策展人和最终用户具有吸引力。我们建议,Omega可能是比DaliLite Z分数更好的结构分类度量标准,但需要注意的是,目前我们仅限于比较具有相同数量SSE的结构。