McGuffin L J, Bryson K, Jones D T
Bioinformatics Group, Department of Biological Sciences, Brunel University, Uxbridge UB8 3PH, UK.
Bioinformatics. 2001 Jan;17(1):63-72. doi: 10.1093/bioinformatics/17.1.63.
What constitutes a baseline level of success for protein fold recognition methods? As fold recognition benchmarks are often presented without any thought to the results that might be expected from a purely random set of predictions, an analysis of fold recognition baselines is long overdue. Given varying amounts of basic information about a protein-ranging from the length of the sequence to a knowledge of its secondary structure-to what extent can the fold be determined by intelligent guesswork? Can simple methods that make use of secondary structure information assign folds more accurately than purely random methods and could these methods be used to construct viable hierarchical classifications? EXPERIMENTS PERFORMED: A number of rapid automatic methods which score similarities between protein domains were devised and tested. These methods ranged from those that incorporated no secondary structure information, such as measuring absolute differences in sequence lengths, to more complex alignments of secondary structure elements. Each method was assessed for accuracy by comparison with the Class Architecture Topology Homology (CATH) classification. Methods were rated against both a random baseline fold assignment method as a lower control and FSSP as an upper control. Similarity trees were constructed in order to evaluate the accuracy of optimum methods at producing a classification of structure.
Using a rigorous comparison of methods with CATH, the random fold assignment method set a lower baseline of 11% true positives allowing for 3% false positives and FSSP set an upper benchmark of 47% true positives at 3% false positives. The optimum secondary structure alignment method used here achieved 27% true positives at 3% false positives. Using a less rigorous Critical Assessment of Structure Prediction (CASP)-like sensitivity measurement the random assignment achieved 6%, FSSP-59% and the optimum secondary structure alignment method-32%. Similarity trees produced by the optimum method illustrate that these methods cannot be used alone to produce a viable protein structural classification system.
Simple methods that use perfect secondary structure information to assign folds cannot produce an accurate protein taxonomy, however they do provide useful baselines for fold recognition. In terms of a typical CASP assessment our results suggest that approximately 6% of targets with folds in the databases could be assigned correctly by randomly guessing, and as many as 32% could be recognised by trivial secondary structure comparison methods, given knowledge of their correct secondary structures.
蛋白质折叠识别方法的成功基线水平是由什么构成的?由于折叠识别基准的呈现往往没有考虑到从一组纯粹随机的预测中可能得到的结果,因此对折叠识别基线的分析早就该进行了。考虑到关于蛋白质的基础信息数量各异,从序列长度到其二级结构的知识,那么通过智能猜测在多大程度上可以确定折叠呢?利用二级结构信息的简单方法能否比纯粹随机的方法更准确地分配折叠,并且这些方法能否用于构建可行的层次分类?
设计并测试了多种对蛋白质结构域之间的相似性进行评分的快速自动方法。这些方法从那些不包含二级结构信息的方法(例如测量序列长度的绝对差异)到更复杂的二级结构元件比对方法不等。通过与类结构拓扑同源性(CATH)分类进行比较来评估每种方法的准确性。将方法与作为下限对照的随机基线折叠分配方法以及作为上限对照的FSSP进行评分比较。构建相似性树以评估最优方法在生成结构分类方面的准确性。
通过将方法与CATH进行严格比较,随机折叠分配方法设定了较低的基线,即真阳性率为11%,允许假阳性率为3%,而FSSP设定了较高的基准,即真阳性率为47%,假阳性率为3%。这里使用的最优二级结构比对方法在假阳性率为3%时实现了27%的真阳性率。使用不太严格的类似蛋白质结构预测关键评估(CASP)的敏感性测量方法,随机分配方法的真阳性率为6%,FSSP为59%,最优二级结构比对方法为32%。最优方法生成的相似性树表明,这些方法不能单独用于生成可行的蛋白质结构分类系统。
使用完美二级结构信息来分配折叠的简单方法无法产生准确的蛋白质分类法,然而它们确实为折叠识别提供了有用的基线。就典型的CASP评估而言,我们的结果表明,在数据库中具有折叠的目标中,大约6%可以通过随机猜测正确分配,并且在知道其正确二级结构的情况下,多达32%可以通过简单的二级结构比较方法识别出来。