Poretsky Elly, Andorf Carson M, Sen Taner Z
Agricultural Research Service, Crop Improvement and Genetics Research Unit U.S. Department of Agriculture Albany CA United States.
Agricultural Research Service, Corn Insects and Crop Genetics Research U.S. Department of Agriculture Ames IA United States.
Plant Direct. 2023 Dec 20;7(12):e554. doi: 10.1002/pld3.554. eCollection 2023 Dec.
Protein phosphorylation is a dynamic and reversible post-translational modification that regulates a variety of essential biological processes. The regulatory role of phosphorylation in cellular signaling pathways, protein-protein interactions, and enzymatic activities has motivated extensive research efforts to understand its functional implications. Experimental protein phosphorylation data in plants remains limited to a few species, necessitating a scalable and accurate prediction method. Here, we present PhosBoost, a machine-learning approach that leverages protein language models and gradient-boosting trees to predict protein phosphorylation from experimentally derived data. Trained on data obtained from a comprehensive plant phosphorylation database, qPTMplants, we compared the performance of PhosBoost to existing protein phosphorylation prediction methods, PhosphoLingo and DeepPhos. For serine and threonine prediction, PhosBoost achieved higher recall than PhosphoLingo and DeepPhos (.78, .56, and .14, respectively) while maintaining a competitive area under the precision-recall curve (.54, .56, and .42, respectively). PhosphoLingo and DeepPhos failed to predict any tyrosine phosphorylation sites, while PhosBoost achieved a recall score of .6. Despite the precision-recall tradeoff, PhosBoost offers improved performance when recall is prioritized while consistently providing more confident probability scores. A sequence-based pairwise alignment step improved prediction results for all classifiers by effectively increasing the number of inferred positive phosphosites. We provide evidence to show that PhosBoost models are transferable across species and scalable for genome-wide protein phosphorylation predictions. PhosBoost is freely and publicly available on GitHub.
蛋白质磷酸化是一种动态且可逆的翻译后修饰,可调节多种重要的生物学过程。磷酸化在细胞信号通路、蛋白质-蛋白质相互作用和酶活性中的调节作用激发了广泛的研究工作,以了解其功能影响。植物中的实验性蛋白质磷酸化数据仍然仅限于少数物种,因此需要一种可扩展且准确的预测方法。在此,我们提出了PhosBoost,这是一种机器学习方法,它利用蛋白质语言模型和梯度提升树从实验获得的数据中预测蛋白质磷酸化。在从全面的植物磷酸化数据库qPTMplants获得的数据上进行训练后,我们将PhosBoost的性能与现有的蛋白质磷酸化预测方法PhosphoLingo和DeepPhos进行了比较。对于丝氨酸和苏氨酸预测,PhosBoost的召回率高于PhosphoLingo和DeepPhos(分别为0.78、0.56和0.14),同时在精确召回曲线下保持有竞争力的面积(分别为0.54、0.56和0.42)。PhosphoLingo和DeepPhos未能预测任何酪氨酸磷酸化位点,而PhosBoost的召回率得分为0.6。尽管存在精确召回权衡,但当优先考虑召回率时,PhosBoost提供了改进的性能,同时始终提供更可靠的概率分数。基于序列的成对比对步骤通过有效增加推断的阳性磷酸化位点数量,改善了所有分类器的预测结果。我们提供证据表明,PhosBoost模型可跨物种转移,并且可扩展用于全基因组蛋白质磷酸化预测。PhosBoost可在GitHub上免费公开获取。