School of Computing, University of Georgia, GA 30602, USA.
Institute of Bioinformatics, University of Georgia, GA 30602, USA.
Bioinformatics. 2023 Feb 3;39(2). doi: 10.1093/bioinformatics/btad046.
The human genome encodes over 500 distinct protein kinases which regulate nearly all cellular processes by the specific phosphorylation of protein substrates. While advances in mass spectrometry and proteomics studies have identified thousands of phosphorylation sites across species, information on the specific kinases that phosphorylate these sites is currently lacking for the vast majority of phosphosites. Recently, there has been a major focus on the development of computational models for predicting kinase-substrate associations. However, most current models only allow predictions on a subset of well-studied kinases. Furthermore, the utilization of hand-curated features and imbalances in training and testing datasets pose unique challenges in the development of accurate predictive models for kinase-specific phosphorylation prediction. Motivated by the recent development of universal protein language models which automatically generate context-aware features from primary sequence information, we sought to develop a unified framework for kinase-specific phosphosite prediction, allowing for greater investigative utility and enabling substrate predictions at the whole kinome level.
We present a deep learning model for kinase-specific phosphosite prediction, termed Phosformer, which predicts the probability of phosphorylation given an arbitrary pair of unaligned kinase and substrate peptide sequences. We demonstrate that Phosformer implicitly learns evolutionary and functional features during training, removing the need for feature curation and engineering. Further analyses reveal that Phosformer also learns substrate specificity motifs and is able to distinguish between functionally distinct kinase families. Benchmarks indicate that Phosformer exhibits significant improvements compared to the state-of-the-art models, while also presenting a more generalized, unified, and interpretable predictive framework.
Code and data are available at https://github.com/esbgkannan/phosformer.
Supplementary data are available at Bioinformatics online.
人类基因组编码了超过 500 种独特的蛋白激酶,通过蛋白质底物的特异性磷酸化来调节几乎所有的细胞过程。尽管质谱和蛋白质组学研究的进展已经在不同物种中鉴定出了数千个磷酸化位点,但目前绝大多数磷酸化位点的具体激酶信息仍然缺乏。最近,人们主要关注开发用于预测激酶-底物关联的计算模型。然而,大多数现有的模型只允许对一组研究充分的激酶进行预测。此外,在训练和测试数据集的不平衡以及手工制作的特征的使用方面存在独特的挑战,这给激酶特异性磷酸化预测的准确预测模型的开发带来了独特的挑战。受最近开发的通用蛋白质语言模型的启发,这些模型可以从原始序列信息中自动生成上下文感知特征,我们试图开发一种用于激酶特异性磷酸化位点预测的统一框架,允许更大的研究效用,并能够在整个激酶组水平上进行底物预测。
我们提出了一种用于激酶特异性磷酸化位点预测的深度学习模型,称为 Phosformer,它可以根据任意一对未对齐的激酶和底物肽序列预测磷酸化的概率。我们证明,Phosformer 在训练过程中隐式地学习进化和功能特征,从而无需进行特征提取和工程设计。进一步的分析表明,Phosformer 还学习了底物特异性基序,并能够区分功能不同的激酶家族。基准测试表明,与最先进的模型相比,Phosformer 有显著的改进,同时也提出了一个更通用、统一和可解释的预测框架。
代码和数据可在 https://github.com/esbgkannan/phosformer 上获得。
补充数据可在生物信息学在线获得。