Department of Biomedical Sciences, University of Padua, Padua, Italy.
PLoS Comput Biol. 2020 Jun 22;16(6):e1007967. doi: 10.1371/journal.pcbi.1007967. eCollection 2020 Jun.
Post-translational modification (PTM) sites have become popular for predictor development. However, with the exception of phosphorylation and a handful of other examples, PTMs suffer from a limited number of available training examples and sparsity in protein sequences. Here, proline hydroxylation is taken as an example to compare different methods and evaluate their performance on new experimentally determined sites. As a guide for effective experimental design, predictors require both high specificity and sensitivity. However, the self-reported performance may often not be indicative of prediction quality and detection of new sites is not guaranteed. We have benchmarked seven published hydroxylation site predictors on two newly constructed independent datasets. The self-reported performance is found to widely overestimate the real accuracy measured on independent datasets. No predictor performs better than random on new examples, indicating the refined models do not sufficiently generalize to detect new sites. The number of false positives is high and precision low, in particular for non-collagen proteins whose motifs are not conserved. As hydroxylation site predictors do not generalize for new data, caution is advised when using PTM predictors in the absence of independent evaluations, in particular for highly specific sites involved in signalling.
翻译后修饰(PTM)位点已成为预测器开发的热门对象。然而,除了磷酸化和少数其他例子外,PTM存在可用训练示例数量有限以及蛋白质序列稀疏的问题。在此,以脯氨酸羟化为例,比较不同方法并评估它们在新的实验确定位点上的性能。作为有效实验设计的指南,预测器需要同时具备高特异性和高灵敏度。然而,自我报告的性能往往可能无法反映预测质量,并且新位点的检测也无法得到保证。我们在两个新构建的独立数据集上对七个已发表的羟化位点预测器进行了基准测试。结果发现,自我报告的性能广泛高估了在独立数据集上测得的实际准确性。在新示例上,没有一个预测器的表现优于随机猜测,这表明优化后的模型不足以泛化以检测新位点。假阳性数量很高且精度很低,特别是对于其基序不保守的非胶原蛋白。由于羟化位点预测器无法对新数据进行泛化,因此在没有独立评估的情况下使用PTM预测器时应谨慎,特别是对于参与信号传导的高度特异性位点。