School of Computer Science and Technology, Soochow University, No. 1. Shizi Street, Suzhou 215006, China.
Department of Experimental Medical Science, BMC B13, Lund University, SE-22 184 Lund, Sweden.
Int J Mol Sci. 2018 Mar 28;19(4):1009. doi: 10.3390/ijms19041009.
Several methods have been developed to predict effects of amino acid substitutions on protein stability. Benchmark datasets are essential for method training and testing and have numerous requirements including that the data is representative for the investigated phenomenon. Available machine learning algorithms for variant stability have all been trained with ProTherm data. We noticed a number of issues with the contents, quality and relevance of the database. There were errors, but also features that had not been clearly communicated. Consequently, all machine learning variant stability predictors have been trained on biased and incorrect data. We obtained a corrected dataset and trained a random forests-based tool, PON-tstab, applicable to variants in any organism. Our results highlight the importance of the benchmark quality, suitability and appropriateness. Predictions are provided for three categories: stability decreasing, increasing and those not affecting stability.
已经开发了几种方法来预测氨基酸取代对蛋白质稳定性的影响。基准数据集对于方法的训练和测试至关重要,并且具有许多要求,包括数据对于所研究现象具有代表性。可用于变体稳定性的机器学习算法均已使用 ProTherm 数据进行了训练。我们注意到数据库的内容、质量和相关性存在一些问题。存在错误,但也有一些特征没有明确传达。因此,所有机器学习变体稳定性预测器都是基于有偏差和不正确的数据进行训练的。我们获得了一个经过修正的数据集,并训练了一个基于随机森林的工具 PON-tstab,适用于任何生物体中的变体。我们的结果强调了基准质量、适用性和适当性的重要性。我们提供了三个类别(稳定性降低、稳定性增加和不影响稳定性)的预测。