College of Computer Science and Technology, Jilin University, Jilin, Changchun 130012, China.
Curr Protein Pept Sci. 2011 Sep;12(6):540-8. doi: 10.2174/138920311796957658.
One of the major challenges in protein tertiary structure prediction is structure quality assessment. In many cases, protein structure prediction tools generate good structural models, but fail to select the best models from a huge number of candidates as the final output. In this study, we developed a sampling-based machine-learning method to rank protein structural models by integrating multiple scores and features. First, features such as predicted secondary structure, solvent accessibility and residue-residue contact information are integrated by two Radial Basis Function (RBF) models trained from different datasets. Then, the two RBF scores and five selected scoring functions developed by others, i.e., Opus-CA, Opus-PSP, DFIRE, RAPDF, and Cheng Score are synthesized by a sampling method. At last, another integrated RBF model ranks the structural models according to the features of sampling distribution. We tested the proposed method by using two different datasets, including the CASP server prediction models of all CASP8 targets and a set of models generated by our in-house software MUFOLD. The test result shows that our method outperforms any individual scoring function on both best model selection, and overall correlation between the predicted ranking and the actual ranking of structural quality.
蛋白质三级结构预测中的主要挑战之一是结构质量评估。在许多情况下,蛋白质结构预测工具可以生成良好的结构模型,但无法从大量候选模型中选择最佳模型作为最终输出。在这项研究中,我们开发了一种基于抽样的机器学习方法,通过整合多个评分和特征来对蛋白质结构模型进行排序。首先,通过从不同数据集训练的两个径向基函数 (RBF) 模型来整合预测的二级结构、溶剂可及性和残基-残基接触信息等特征。然后,通过抽样方法将两个 RBF 得分和五个由他人开发的选择评分函数(Opus-CA、Opus-PSP、DFIRE、RAPDF 和 Cheng 得分)进行综合。最后,另一个集成的 RBF 模型根据抽样分布的特征对结构模型进行排序。我们使用两个不同的数据集(包括所有 CASP8 目标的 CASP 服务器预测模型和我们内部软件 MUFOLD 生成的一组模型)来测试所提出的方法。测试结果表明,我们的方法在最佳模型选择和结构质量预测排名与实际排名之间的整体相关性方面,优于任何单个评分函数。