Li Guobin, Du Xiuquan, Li Xinlu, Zou Le, Zhang Guanhong, Wu Zhize
School of Artificial Intelligence and Big Data, Hefei University, Hefei, China.
School of Computer Science and Technology, Anhui University, Hefei, China.
PeerJ. 2021 May 3;9:e11262. doi: 10.7717/peerj.11262. eCollection 2021.
DNA-binding proteins (DBPs) play pivotal roles in many biological functions such as alternative splicing, RNA editing, and methylation. Many traditional machine learning (ML) methods and deep learning (DL) methods have been proposed to predict DBPs. However, these methods either rely on manual feature extraction or fail to capture long-term dependencies in the DNA sequence. In this paper, we propose a method, called PDBP-Fusion, to identify DBPs based on the fusion of local features and long-term dependencies only from primary sequences. We utilize convolutional neural network (CNN) to learn local features and use bi-directional long-short term memory network (Bi-LSTM) to capture critical long-term dependencies in context. Besides, we perform feature extraction, model training, and model prediction simultaneously. The PDBP-Fusion approach can predict DBPs with 86.45% sensitivity, 79.13% specificity, 82.81% accuracy, and 0.661 MCC on the PDB14189 benchmark dataset. The MCC of our proposed methods has been increased by at least 9.1% compared to other advanced prediction models. Moreover, the PDBP-Fusion also gets superior performance and model robustness on the PDB2272 independent dataset. It demonstrates that the PDBP-Fusion can be used to predict DBPs from sequences accurately and effectively; the online server is at http://119.45.144.26:8080/PDBP-Fusion/.
DNA结合蛋白(DBP)在许多生物学功能中发挥着关键作用,如可变剪接、RNA编辑和甲基化。已经提出了许多传统机器学习(ML)方法和深度学习(DL)方法来预测DBP。然而,这些方法要么依赖于手动特征提取,要么无法捕捉DNA序列中的长期依赖性。在本文中,我们提出了一种名为PDBP-Fusion的方法,仅基于一级序列的局部特征和长期依赖性融合来识别DBP。我们利用卷积神经网络(CNN)学习局部特征,并使用双向长短期记忆网络(Bi-LSTM)来捕捉上下文中的关键长期依赖性。此外,我们同时进行特征提取、模型训练和模型预测。PDBP-Fusion方法在PDB14189基准数据集上预测DBP的灵敏度为86.45%,特异性为79.13%,准确率为82.81%,马修斯相关系数(MCC)为0.661。与其他先进的预测模型相比,我们提出的方法的MCC至少提高了9.1%。此外,PDBP-Fusion在PDB2272独立数据集上也具有卓越的性能和模型稳健性。这表明PDBP-Fusion可以准确有效地从序列中预测DBP;在线服务器位于http://119.45.144.26:8080/PDBP-Fusion/ 。