Sam Vichetra, Tai Chin-Hsien, Garnier Jean, Gibrat Jean-Francois, Lee Byungkook, Munson Peter J
Mathematical and Statistical Computing Laboratory, DCB, CIT, NIH, DHHS, Bethesda, MD, USA.
BMC Bioinformatics. 2006 Apr 13;7:206. doi: 10.1186/1471-2105-7-206.
Current classification of protein folds are based, ultimately, on visual inspection of similarities. Previous attempts to use computerized structure comparison methods show only partial agreement with curated databases, but have failed to provide detailed statistical and structural analysis of the causes of these divergences.
We construct a map of similarities/dissimilarities among manually defined protein folds, using a score cutoff value determined by means of the Receiver Operating Characteristics curve. It identifies folds which appear to overlap or to be "confused" with each other by two distinct similarity measures. It also identifies folds which appear inhomogeneous in that they contain apparently dissimilar domains, as measured by both similarity measures. At a low (1%) false positive rate, 25 to 38% of domain pairs in the same SCOP folds do not appear similar. Our results suggest either that some of these folds are defined using criteria other than purely structural consideration or that the similarity measures used do not recognize some relevant aspects of structural similarity in certain cases. Specifically, variations of the "common core" of some folds are severe enough to defeat attempts to automatically detect structural similarity and/or to lead to false detection of similarity between domains in distinct folds. Structures in some folds vary greatly in size because they contain varying numbers of a repeating unit, while similarity scores are quite sensitive to size differences. Structures in different folds may contain similar substructures, which produce false positives. Finally, the common core within a structure may be too small relative to the entire structure, to be recognized as the basis of similarity to another.
A detailed analysis of the entire available protein fold space by two automated similarity methods reveals the extent and the nature of the divergence between the automatically determined similarity/dissimilarity and the manual fold type classifications. Some of the observed divergences can probably be addressed with better structure comparison methods and better automatic, intelligent classification procedures. Others may be intrinsic to the problem, suggesting a continuous rather than discrete protein fold space.
目前蛋白质折叠的分类最终基于对相似性的目视检查。先前使用计算机化结构比较方法的尝试仅与经过整理的数据库部分一致,但未能对这些差异的原因提供详细的统计和结构分析。
我们构建了一幅手动定义的蛋白质折叠之间相似性/不相似性的图谱,使用通过接收者操作特征曲线确定的得分截止值。它通过两种不同的相似性度量识别出似乎相互重叠或“混淆”的折叠。它还识别出那些由于包含明显不同的结构域而显得不均匀的折叠,这两种相似性度量都能测量到这种不均匀性。在1%的低误报率下,同一SCOP折叠中的25%至38%的结构域对看起来并不相似。我们的结果表明,要么这些折叠中的一些是使用除了纯粹结构考虑之外的标准定义的,要么在某些情况下所使用的相似性度量没有识别出结构相似性的一些相关方面。具体而言,一些折叠的“共同核心”变化足够严重,以至于无法自动检测到结构相似性,并且/或者导致错误地检测到不同折叠中结构域之间的相似性。一些折叠中的结构大小差异很大,因为它们包含不同数量的重复单元,而相似性得分对大小差异非常敏感。不同折叠中的结构可能包含相似的子结构,这会产生误报。最后,一个结构内的共同核心相对于整个结构可能太小,以至于无法被识别为与另一个结构相似的基础。
通过两种自动相似性方法对整个可用蛋白质折叠空间进行详细分析,揭示了自动确定的相似性/不相似性与手动折叠类型分类之间差异的程度和性质。一些观察到的差异可能可以通过更好的结构比较方法和更好的自动智能分类程序来解决。其他差异可能是该问题所固有的,这表明蛋白质折叠空间是连续的而非离散的。