BERT-Kcr：基于预训练BERT模型的迁移学习方法预测赖氨酸巴豆酰化位点

BERT-Kcr: prediction of lysine crotonylation sites by a transfer learning method with pre-trained BERT models.

作者信息

Qiao Yanhua, Zhu Xiaolei, Gong Haipeng

机构信息

School of Life Sciences, Tsinghua University, Beijing 100084, China.

School of Sciences, Anhui Agricultural University, Hefei, Anhui 230036, China.

出版信息

Bioinformatics. 2022 Jan 12;38(3):648-654. doi: 10.1093/bioinformatics/btab712.

DOI:10.1093/bioinformatics/btab712

PMID:34643684

Abstract

MOTIVATION

As one of the most important post-translational modifications (PTMs), protein lysine crotonylation (Kcr) has attracted wide attention, which involves in important physiological activities, such as cell differentiation and metabolism. However, experimental methods are expensive and time-consuming for Kcr identification. Instead, computational methods can predict Kcr sites in silico with high efficiency and low cost.

RESULTS

In this study, we proposed a novel predictor, BERT-Kcr, for protein Kcr sites prediction, which was developed by using a transfer learning method with pre-trained bidirectional encoder representations from transformers (BERT) models. These models were originally used for natural language processing (NLP) tasks, such as sentence classification. Here, we transferred each amino acid into a word as the input information to the pre-trained BERT model. The features encoded by BERT were extracted and then fed to a BiLSTM network to build our final model. Compared with the models built by other machine learning and deep learning classifiers, BERT-Kcr achieved the best performance with AUROC of 0.983 for 10-fold cross validation. Further evaluation on the independent test set indicates that BERT-Kcr outperforms the state-of-the-art model Deep-Kcr with an improvement of about 5% for AUROC. The results of our experiment indicate that the direct use of sequence information and advanced pre-trained models of NLP could be an effective way for identifying PTM sites of proteins.

AVAILABILITY AND IMPLEMENTATION

The BERT-Kcr model is publicly available on http://zhulab.org.cn/BERT-Kcr_models/.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

作为最重要的翻译后修饰（PTM）之一，蛋白质赖氨酸巴豆酰化（Kcr）已引起广泛关注，其涉及细胞分化和代谢等重要生理活动。然而，实验方法用于Kcr鉴定既昂贵又耗时。相反，计算方法可以在计算机上高效且低成本地预测Kcr位点。

结果

在本研究中，我们提出了一种用于蛋白质Kcr位点预测的新型预测器BERT-Kcr，它是通过使用基于预训练的双向编码器表征来自变换器（BERT）模型的迁移学习方法开发的。这些模型最初用于自然语言处理（NLP）任务，如句子分类。在这里，我们将每个氨基酸转换为一个单词作为预训练BERT模型的输入信息。提取由BERT编码的特征，然后将其输入到双向长短期记忆网络（BiLSTM）中以构建我们的最终模型。与其他机器学习和深度学习分类器构建的模型相比，BERT-Kcr在10折交叉验证中以0.983的曲线下面积（AUROC）实现了最佳性能。在独立测试集上的进一步评估表明，BERT-Kcr优于现有技术模型Deep-Kcr，AUROC提高了约5%。我们的实验结果表明，直接使用序列信息和先进的预训练NLP模型可能是识别蛋白质PTM位点的有效方法。