Qiao Yanhua, Zhu Xiaolei, Gong Haipeng
School of Life Sciences, Tsinghua University, Beijing 100084, China.
School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China.
Bioinformatics. 2022 Jan 12;38(3):648-654. doi: 10.1093/bioinformatics/btab712.
As one of the most important post-translational modifications (PTMs), protein lysine crotonylation (Kcr) has attracted wide attention, which involves in important physiological activities, such as cell differentiation and metabolism. However, experimental methods are expensive and time-consuming for Kcr identification. Instead, computational methods can predict Kcr sites in silico with high efficiency and low cost.
In this study, we proposed a novel predictor, BERT-Kcr, for protein Kcr sites prediction, which was developed by using a transfer learning method with pre-trained bidirectional encoder representations from transformers (BERT) models. These models were originally used for natural language processing (NLP) tasks, such as sentence classification. Here, we transferred each amino acid into a word as the input information to the pre-trained BERT model. The features encoded by BERT were extracted and then fed to a BiLSTM network to build our final model. Compared with the models built by other machine learning and deep learning classifiers, BERT-Kcr achieved the best performance with AUROC of 0.983 for 10-fold cross validation. Further evaluation on the independent test set indicates that BERT-Kcr outperforms the state-of-the-art model Deep-Kcr with an improvement of about 5% for AUROC. The results of our experiment indicate that the direct use of sequence information and advanced pre-trained models of NLP could be an effective way for identifying PTM sites of proteins.
The BERT-Kcr model is publicly available on http://zhulab.org.cn/BERT-Kcr_models/.
Supplementary data are available at Bioinformatics online.
作为最重要的翻译后修饰(PTM)之一,蛋白质赖氨酸巴豆酰化(Kcr)已引起广泛关注,其涉及细胞分化和代谢等重要生理活动。然而,实验方法用于Kcr鉴定既昂贵又耗时。相反,计算方法可以在计算机上高效且低成本地预测Kcr位点。
在本研究中,我们提出了一种用于蛋白质Kcr位点预测的新型预测器BERT-Kcr,它是通过使用基于预训练的双向编码器表征来自变换器(BERT)模型的迁移学习方法开发的。这些模型最初用于自然语言处理(NLP)任务,如句子分类。在这里,我们将每个氨基酸转换为一个单词作为预训练BERT模型的输入信息。提取由BERT编码的特征,然后将其输入到双向长短期记忆网络(BiLSTM)中以构建我们的最终模型。与其他机器学习和深度学习分类器构建的模型相比,BERT-Kcr在10折交叉验证中以0.983的曲线下面积(AUROC)实现了最佳性能。在独立测试集上的进一步评估表明,BERT-Kcr优于现有技术模型Deep-Kcr,AUROC提高了约5%。我们的实验结果表明,直接使用序列信息和先进的预训练NLP模型可能是识别蛋白质PTM位点的有效方法。
BERT-Kcr模型可在http://zhulab.org.cn/BERT-Kcr_models/上公开获取。
补充数据可在《生物信息学》在线获取。