Borreguero Jose M, Skolnick Jeffrey
Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, Georgia 30318, USA.
Proteins. 2007 Jul 1;68(1):48-56. doi: 10.1002/prot.21392.
A significant number of protein sequences in a given proteome have no obvious evolutionarily related protein in the database of solved protein structures, the PDB. Under these conditions, ab initio or template-free modeling methods are the sole means of predicting protein structure. To assess its expected performance on proteomes, the TASSER structure prediction algorithm is benchmarked in the ab initio limit on a representative set of 1129 nonhomologous sequences ranging from 40 to 200 residues that cover the PDB at 30% sequence identity and which adopt alpha, alpha + beta, and beta secondary structures. For sequences in the 40-100 (100-200) residue range, as assessed by their root mean square deviation from native, RMSD, the best of the top five ranked models of TASSER has a global fold that is significantly close to the native structure for 25% (16%) of the sequences, and with a correct identification of the structure of the protein core for 59% (36%). In the absence of a native structure, the structural similarity among the top five ranked models is a moderately reliable predictor of folding accuracy. If we classify the sequences according to their secondary structure content, then 64% (36%) of alpha, 43% (24%) of alpha + beta, and 20% (12%) of beta sequences in the 40-100 (100-200) residue range have a significant TM-score (TM-score > or = 0.4). TASSER performs best on helical proteins because there are less secondary structural elements to arrange in a helical protein than in a beta protein of equal length, since the average length of a helix is longer than that of a strand. In addition, helical proteins have shorter loops and dangling tails. If we exclude these flexible fragments, then TASSER has similar accuracy for sequences containing the same number of secondary structural elements, irrespective of whether they are helices and/or strands. Thus, it is the effective configurational entropy of the protein that dictates the average likelihood of correctly arranging the secondary structure elements.
在给定的蛋白质组中,有相当数量的蛋白质序列在已解析蛋白质结构的数据库(PDB)中没有明显的进化相关蛋白质。在这种情况下,从头开始或无模板建模方法是预测蛋白质结构的唯一手段。为了评估其在蛋白质组上的预期性能,TASSER结构预测算法在从头开始的极限条件下,以一组1129个非同源序列为基准进行测试,这些序列长度在40至200个残基之间,以30%的序列同一性覆盖PDB,并且采用α、α + β和β二级结构。对于40 - 100(100 - 200)残基范围内的序列,通过与天然结构的均方根偏差(RMSD)评估,TASSER排名前五的模型中,最好的模型对于25%(16%)的序列具有与天然结构显著接近的全局折叠,并且对于59%(36%)的序列能够正确识别蛋白质核心结构。在没有天然结构的情况下,排名前五的模型之间的结构相似性是折叠准确性的一个适度可靠的预测指标。如果根据二级结构含量对序列进行分类,那么在40 - 100(100 - 200)残基范围内,64%(36%)的α序列、43%(24%)的α + β序列和20%(12%)的β序列具有显著的TM分数(TM分数≥0.4)。TASSER在螺旋蛋白上表现最佳,因为与等长的β蛋白相比,螺旋蛋白中需要排列的二级结构元件更少,这是由于螺旋的平均长度比链长。此外,螺旋蛋白的环和悬垂尾较短。如果排除这些柔性片段,那么对于包含相同数量二级结构元件的序列,无论它们是螺旋和/或链,TASSER都具有相似的准确性。因此,是蛋白质的有效构型熵决定了正确排列二级结构元件的平均可能性。