基于预训练语言模型BERT的DNA结合蛋白预测识别

Predictive Recognition of DNA-binding Proteins Based on Pre-trained Language Model BERT.

作者信息

Ma Yue, Pei Yongzhen, Li Changguo

机构信息

School of Computer Science and Technology, Tiangong University, Tianjin, P. R. China.

School of Mathematical Sciences, Tiangong University, Tianjin, P. R. China.

出版信息

J Bioinform Comput Biol. 2023 Dec;21(6):2350028. doi: 10.1142/S0219720023500282. Epub 2024 Jan 23.

DOI:10.1142/S0219720023500282

PMID:38248912

Abstract

Identifying proteins is crucial for disease diagnosis and treatment. With the increase of known proteins, large-scale batch predictions are essential. However, traditional biological experiments being time-consuming and expensive are difficult to accomplish this task efficiently. Nevertheless, deep learning algorithms based on big data analysis have manifested potential in this aspect. In recent years, language representation models, especially BERT, have made significant advancements in natural language processing. In this paper, using three protein segmentation methods and three encoder numbers, nine BERT models with different sizes are constructed to predict whether known proteins are DNA-binding proteins or not. Furthermore, based on the concept of protein motifs, multi-scale convolutional networks are fused into the models to extract the local features of DNA-binding proteins. Finally, we find that the larger the number of encoders, the better the model predictions under the condition of considering each amino acid in the protein as a word. Our proposed algorithm achieves 81.88% sensitivity and 0.39 MCC value on the test set. Furthermore, it achieves 62.41% accuracy on the independent test set PDB2272. It is evident that our proposed method can be a tool to assist in the identification of DNA-binding proteins.

摘要

识别蛋白质对于疾病诊断和治疗至关重要。随着已知蛋白质数量的增加，大规模批量预测必不可少。然而，传统生物学实验既耗时又昂贵，难以高效完成这项任务。尽管如此，基于大数据分析的深度学习算法在这方面已展现出潜力。近年来，语言表示模型，尤其是BERT，在自然语言处理方面取得了重大进展。在本文中，使用三种蛋白质分割方法和三种编码器数量，构建了九个不同大小的BERT模型，以预测已知蛋白质是否为DNA结合蛋白。此外，基于蛋白质基序的概念，将多尺度卷积网络融合到模型中，以提取DNA结合蛋白的局部特征。最后，我们发现，在将蛋白质中的每个氨基酸视为一个单词的情况下，编码器数量越多，模型预测效果越好。我们提出的算法在测试集上的灵敏度达到81.88%，马修斯相关系数（MCC）值为0.39。此外，它在独立测试集PDB2272上的准确率达到62.41%。显然，我们提出的方法可以成为辅助识别DNA结合蛋白的工具。