School of Computing, University of Georgia, Athens, GA 30602, United States.
Institute of Bioinformatics, University of Georgia, Athens, GA 30602, United States.
Bioinformatics. 2024 Feb 1;40(2). doi: 10.1093/bioinformatics/btae033.
Phosphorylation, a post-translational modification regulated by protein kinase enzymes, plays an essential role in almost all cellular processes. Understanding how each of the nearly 500 human protein kinases selectively phosphorylates their substrates is a foundational challenge in bioinformatics and cell signaling. Although deep learning models have been a popular means to predict kinase-substrate relationships, existing models often lack interpretability and are trained on datasets skewed toward a subset of well-studied kinases.
Here we leverage recent peptide library datasets generated to determine substrate specificity profiles of 300 serine/threonine kinases to develop an explainable Transformer model for kinase-peptide interaction prediction. The model, trained solely on primary sequences, achieved state-of-the-art performance. Its unique multitask learning paradigm built within the model enables predictions on virtually any kinase-peptide pair, including predictions on 139 kinases not used in peptide library screens. Furthermore, we employed explainable machine learning methods to elucidate the model's inner workings. Through analysis of learned embeddings at different training stages, we demonstrate that the model employs a unique strategy of substrate prediction considering both substrate motif patterns and kinase evolutionary features. SHapley Additive exPlanation (SHAP) analysis reveals key specificity determining residues in the peptide sequence. Finally, we provide a web interface for predicting kinase-substrate associations for user-defined sequences and a resource for visualizing the learned kinase-substrate associations.
All code and data are available at https://github.com/esbgkannan/Phosformer-ST. Web server is available at https://phosformer.netlify.app.
磷酸化是一种受蛋白激酶酶调控的翻译后修饰,在几乎所有细胞过程中都起着至关重要的作用。了解近 500 个人类蛋白激酶中的每一种如何选择性地磷酸化其底物,是生物信息学和细胞信号中的一个基本挑战。尽管深度学习模型一直是预测激酶-底物关系的一种流行手段,但现有的模型往往缺乏可解释性,并且是在偏向于少数研究充分的激酶的数据集上进行训练的。
在这里,我们利用最近生成的肽库数据集,来确定 300 种丝氨酸/苏氨酸激酶的底物特异性特征,以开发一种可解释的 Transformer 模型,用于激酶-肽相互作用预测。该模型仅基于一级序列进行训练,达到了最先进的性能。其独特的多任务学习范式在模型内构建,使其能够对几乎任何激酶-肽对进行预测,包括对肽库筛选中未使用的 139 种激酶进行预测。此外,我们还采用了可解释的机器学习方法来阐明模型的内部工作原理。通过在不同训练阶段分析学习到的嵌入,我们证明该模型采用了一种独特的底物预测策略,同时考虑了底物模体模式和激酶进化特征。SHapley Additive exPlanation (SHAP) 分析揭示了肽序列中决定特异性的关键残基。最后,我们提供了一个用于预测用户定义序列中激酶-底物关联的网络界面,并提供了一个可视化学习到的激酶-底物关联的资源。
所有代码和数据都可在 https://github.com/esbgkannan/Phosformer-ST 上获得。网络服务器可在 https://phosformer.netlify.app 上访问。