Department of Computer Science, Informatics Institute, Christopher S, Bond Life Science Center, University of Missouri, Columbia, MO 65211, USA.
BMC Bioinformatics. 2014 Apr 28;15:120. doi: 10.1186/1471-2105-15-120.
It is important to predict the quality of a protein structural model before its native structure is known. The method that can predict the absolute local quality of individual residues in a single protein model is rare, yet particularly needed for using, ranking and refining protein models.
We developed a machine learning tool (SMOQ) that can predict the distance deviation of each residue in a single protein model. SMOQ uses support vector machines (SVM) with protein sequence and structural features (i.e. basic feature set), including amino acid sequence, secondary structures, solvent accessibilities, and residue-residue contacts to make predictions. We also trained a SVM model with two new additional features (profiles and SOV scores) on 20 CASP8 targets and found that including them can only improve the performance when real deviations between native and model are higher than 5Å. The SMOQ tool finally released uses the basic feature set trained on 85 CASP8 targets. Moreover, SMOQ implemented a way to convert predicted local quality scores into a global quality score. SMOQ was tested on the 84 CASP9 single-domain targets. The average difference between the residue-specific distance deviation predicted by our method and the actual distance deviation on the test data is 2.637Å. The global quality prediction accuracy of the tool is comparable to other good tools on the same benchmark.
SMOQ is a useful tool for protein single model quality assessment. Its source code and executable are available at: http://sysbio.rnet.missouri.edu/multicom_toolbox/.
在未知蛋白质结构的情况下,预测蛋白质结构模型的质量非常重要。能够预测单个蛋白质模型中各个残基绝对局部质量的方法很少,但对于使用、排序和精炼蛋白质模型来说,这种方法尤其需要。
我们开发了一种机器学习工具(SMOQ),可以预测单个蛋白质模型中每个残基的距离偏差。SMOQ 使用支持向量机(SVM)结合蛋白质序列和结构特征(即基本特征集),包括氨基酸序列、二级结构、溶剂可及性和残基-残基接触来进行预测。我们还在 20 个 CASP8 目标上使用两个新的附加特征(轮廓和 SOV 分数)训练了一个 SVM 模型,并发现只有在真实偏差大于 5Å时,包含这些特征才能提高性能。SMOQ 工具最终使用在 85 个 CASP8 目标上训练的基本特征集发布。此外,SMOQ 实现了一种将预测的局部质量分数转换为全局质量分数的方法。SMOQ 在 84 个 CASP9 单域目标上进行了测试。我们的方法预测的残基特定距离偏差与测试数据上的实际距离偏差之间的平均差异为 2.637Å。该工具的全局质量预测准确性与同一基准上的其他优秀工具相当。
SMOQ 是一种用于蛋白质单模型质量评估的有用工具。其源代码和可执行文件可在以下网址获得:http://sysbio.rnet.missouri.edu/multicom_toolbox/。