Department of Biochemistry and Biophysics and Science for Life Laboratory, Stockholm University, Solna, Sweden.
Bioinformatics Short-term Support and Infrastructure (BILS), Science for Life Laboratory, Solna, Sweden.
Proteins. 2018 Jun;86(6):654-663. doi: 10.1002/prot.25492. Epub 2018 Apr 15.
Protein modeling quality is an important part of protein structure prediction. We have for more than a decade developed a set of methods for this problem. We have used various types of description of the protein and different machine learning methodologies. However, common to all these methods has been the target function used for training. The target function in ProQ describes the local quality of a residue in a protein model. In all versions of ProQ the target function has been the S-score. However, other quality estimation functions also exist, which can be divided into superposition- and contact-based methods. The superposition-based methods, such as S-score, are based on a rigid body superposition of a protein model and the native structure, while the contact-based methods compare the local environment of each residue. Here, we examine the effects of retraining our latest predictor, ProQ3D, using identical inputs but different target functions. We find that the contact-based methods are easier to predict and that predictors trained on these measures provide some advantages when it comes to identifying the best model. One possible reason for this is that contact based methods are better at estimating the quality of multi-domain targets. However, training on the S-score gives the best correlation with the GDT_TS score, which is commonly used in CASP to score the global model quality. To take the advantage of both of these features we provide an updated version of ProQ3D that predicts local and global model quality estimates based on different quality estimates.
蛋白质建模质量是蛋白质结构预测的重要组成部分。我们已经开发了十多种针对该问题的方法。我们使用了各种类型的蛋白质描述和不同的机器学习方法。然而,所有这些方法都有一个共同点,即用于训练的目标函数。ProQ 中的目标函数描述了蛋白质模型中残基的局部质量。在 ProQ 的所有版本中,目标函数一直是 S 分。然而,也存在其他质量估计函数,可以分为基于叠加和基于接触的方法。基于叠加的方法,如 S 分,是基于蛋白质模型和天然结构的刚体叠加,而基于接触的方法则比较每个残基的局部环境。在这里,我们研究了使用相同的输入但不同的目标函数重新训练我们最新的预测器 ProQ3D 的效果。我们发现,基于接触的方法更容易预测,并且基于这些度量训练的预测器在识别最佳模型方面具有一些优势。造成这种情况的一个可能原因是,基于接触的方法更善于估计多域目标的质量。然而,基于 S 分的训练与 GDT_TS 评分的相关性最好,GDT_TS 评分常用于 CASP 来评分全局模型质量。为了利用这两个特性,我们提供了 ProQ3D 的更新版本,该版本基于不同的质量估计来预测局部和全局模型质量估计。