Eramian David, Eswar Narayanan, Shen Min-Yi, Sali Andrej
Graduate Group in Biophysics, University of California at San Francisco, California 94158, USA.
Protein Sci. 2008 Nov;17(11):1881-93. doi: 10.1110/ps.036061.108. Epub 2008 Oct 1.
Comparative structure models are available for two orders of magnitude more protein sequences than are experimentally determined structures. These models, however, suffer from two limitations that experimentally determined structures do not: They frequently contain significant errors, and their accuracy cannot be readily assessed. We have addressed the latter limitation by developing a protocol optimized specifically for predicting the Calpha root-mean-squared deviation (RMSD) and native overlap (NO3.5A) errors of a model in the absence of its native structure. In contrast to most traditional assessment scores that merely predict one model is more accurate than others, this approach quantifies the error in an absolute sense, thus helping to determine whether or not the model is suitable for intended applications. The assessment relies on a model-specific scoring function constructed by a support vector machine. This regression optimizes the weights of up to nine features, including various sequence similarity measures and statistical potentials, extracted from a tailored training set of models unique to the model being assessed: If possible, we use similarly sized models with the same fold; otherwise, we use similarly sized models with the same secondary structure composition. This protocol predicts the RMSD and NO3.5A errors for a diverse set of 580,317 comparative models of 6174 sequences with correlation coefficients (r) of 0.84 and 0.86, respectively, to the actual errors. This scoring function achieves the best correlation compared to 13 other tested assessment criteria that achieved correlations ranging from 0.35 to 0.71.
与通过实验确定的蛋白质结构相比,比较结构模型可用于多两个数量级的蛋白质序列。然而,这些模型存在两个实验确定的结构所没有的局限性:它们经常包含重大错误,并且其准确性难以轻易评估。我们通过开发一种专门优化的方案来解决后一个局限性,该方案用于在没有天然结构的情况下预测模型的Cα均方根偏差(RMSD)和天然重叠(NO3.5A)误差。与大多数传统评估分数仅仅预测一个模型比其他模型更准确不同,这种方法从绝对意义上量化误差,从而有助于确定该模型是否适用于预期应用。该评估依赖于由支持向量机构建的特定于模型的评分函数。这种回归优化了多达九个特征的权重,这些特征包括从针对被评估模型的定制训练模型集中提取的各种序列相似性度量和统计势:如果可能,我们使用具有相同折叠的大小相似的模型;否则,我们使用具有相同二级结构组成的大小相似的模型。该方案预测了6174个序列的580317个不同比较模型的RMSD和NO3.5A误差,与实际误差的相关系数(r)分别为0.84和0.86。与其他13个测试评估标准(相关系数范围为0.35至0.71)相比,该评分函数实现了最佳相关性。