基于预训练的 ELECTRA 模型的激酶特异性磷酸化位点预测。

A Pretrained ELECTRA Model for Kinase-Specific Phosphorylation Site Prediction.

机构信息

Department of Electrical Engineering and Computer Science and Christopher S. Bond Life Sciences Center, University of Missouri, Columbia, MO, USA.

出版信息

Methods Mol Biol. 2022;2499:105-124. doi: 10.1007/978-1-0716-2317-6_4.

DOI:10.1007/978-1-0716-2317-6_4

PMID:35696076

Abstract

Phosphorylation plays a vital role in signal transduction and cell cycle. Identifying and understanding phosphorylation through machine-learning methods has a long history. However, existing methods only learn representations of a protein sequence segment from a labeled dataset itself, which could result in biased or incomplete features, especially for kinase-specific phosphorylation site prediction in which training data are typically sparse. To learn a comprehensive contextual representation of a protein sequence segment for kinase-specific phosphorylation site prediction, we pretrained our model from over 24 million unlabeled sequence fragments using ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately). The pretrained model was applied to kinase-specific site prediction of kinases CDK, PKA, CK2, MAPK, and PKC. The pretrained ELECTRA model achieves 9.02% improvement over BERT and 11.10% improvement over MusiteDeep in the area under the precision-recall curve on the benchmark data.

摘要

磷酸化在信号转导和细胞周期中起着至关重要的作用。通过机器学习方法识别和理解磷酸化有着悠久的历史。然而，现有的方法仅从标记数据集本身学习蛋白质序列片段的表示，这可能导致有偏差或不完整的特征，特别是在激酶特异性磷酸化位点预测中，训练数据通常是稀疏的。为了学习激酶特异性磷酸化位点预测的蛋白质序列片段的全面上下文表示，我们使用 ELECTRA（准确学习分类标记替换的编码器）从超过 2400 万个未标记的序列片段中对我们的模型进行了预训练。该预训练模型应用于 CDK、PKA、CK2、MAPK 和 PKC 激酶的激酶特异性位点预测。在基准数据上的精度-召回曲线下面积方面，预训练的 ELECTRA 模型比 BERT 提高了 9.02%，比 MusiteDeep 提高了 11.10%。