Sam Vichetra, Tai Chin-Hsien, Garnier Jean, Gibrat Jean-Francois, Lee Byungkook, Munson Peter J
Mathematical and Statistical Computing Laboratory, DCB, CIT, NIH, DHHS, Bethesda, MD, USA.
BMC Bioinformatics. 2008 Jan 31;9:74. doi: 10.1186/1471-2105-9-74.
Formal classification of a large collection of protein structures aids the understanding of evolutionary relationships among them. Classifications involving manual steps, such as SCOP and CATH, face the challenge of increasing volume of available structures. Automatic methods such as FSSP or Dali Domain Dictionary, yield divergent classifications, for reasons not yet fully investigated. One possible reason is that the pairwise similarity scores used in automatic classification do not adequately reflect the judgments made in manual classification. Another possibility is the difference between manual and automatic classification procedures. We explore the degree to which these two factors might affect the final classification.
We use DALI, SHEBA and VAST pairwise scores on the SCOP C class domains, to investigate a variety of hierarchical clustering procedures. The constructed dendrogram is cut in a variety of ways to produce a partition, which is compared to the SCOP fold classification.Ward's method dendrograms led to partitions closest to the SCOP fold classification. Dendrogram- or tree-cutting strategies fell into four categories according to the similarity of resulting partitions to the SCOP fold partition. Two strategies which optimize similarity to SCOP, gave an average of 72% true positives rate (TPR), at a 1% false positive rate. Cutting the largest size cluster at each step gave an average of 61% TPR which was one of the best strategies not making use of prior knowledge of SCOP. Cutting the longest branch at each step produced one of the worst strategies. We also developed a method to detect irreducible differences between the best possible automatic partitions and SCOP, regardless of the cutting strategy. These differences are substantial. Visual examination of hard-to-classify proteins confirms our previous finding, that global structural similarity of domains is not the only criterion used in the SCOP classification.
Different clustering procedures give rise to different levels of agreement between automatic and manual protein classifications. None of the tested procedures completely eliminates the divergence between automatic and manual protein classifications. Achieving full agreement between these two approaches would apparently require additional information.
对大量蛋白质结构进行正式分类有助于理解它们之间的进化关系。涉及人工步骤的分类方法,如SCOP和CATH,面临着可用结构数量不断增加的挑战。诸如FSSP或Dali Domain Dictionary等自动方法产生了不同的分类结果,原因尚未完全研究清楚。一个可能的原因是自动分类中使用的成对相似性得分没有充分反映人工分类中的判断。另一种可能性是人工和自动分类程序之间的差异。我们探讨了这两个因素可能影响最终分类的程度。
我们在SCOP C类结构域上使用DALI、SHEBA和VAST成对得分,研究了各种层次聚类程序。以各种方式切割构建的树状图以产生一个划分,并将其与SCOP折叠分类进行比较。Ward方法树状图导致的划分最接近SCOP折叠分类。根据所得划分与SCOP折叠划分的相似性,树状图或树切割策略分为四类。两种优化与SCOP相似性的策略,在1%的误报率下,平均真阳性率(TPR)为72%。在每一步切割最大规模的聚类,平均TPR为61%,这是不利用SCOP先验知识的最佳策略之一。在每一步切割最长的分支产生了最差的策略之一。我们还开发了一种方法来检测最佳可能的自动划分与SCOP之间的不可约差异,而不管切割策略如何。这些差异很大。对难以分类的蛋白质进行视觉检查证实了我们之前的发现,即结构域的全局结构相似性不是SCOP分类中使用的唯一标准。
不同的聚类程序导致自动和人工蛋白质分类之间的一致程度不同。没有一种测试程序能完全消除自动和人工蛋白质分类之间的差异。要使这两种方法完全一致,显然需要额外的信息。