Qu Yu-Hui, Yu Hua, Gong Xiu-Jun, Xu Jia-Hui, Lee Hong-Shun
School of Computer Science and Technology, Tianjin University, Nankai, Tianjin, China, 30072.
Tianjin Key Laboratory of Cognitive Computing and Application, Nankai, Tianjin, China, 30072.
PLoS One. 2017 Dec 29;12(12):e0188129. doi: 10.1371/journal.pone.0188129. eCollection 2017.
DNA-binding proteins play pivotal roles in alternative splicing, RNA editing, methylating and many other biological functions for both eukaryotic and prokaryotic proteomes. Predicting the functions of these proteins from primary amino acids sequences is becoming one of the major challenges in functional annotations of genomes. Traditional prediction methods often devote themselves to extracting physiochemical features from sequences but ignoring motif information and location information between motifs. Meanwhile, the small scale of data volumes and large noises in training data result in lower accuracy and reliability of predictions. In this paper, we propose a deep learning based method to identify DNA-binding proteins from primary sequences alone. It utilizes two stages of convolutional neutral network to detect the function domains of protein sequences, and the long short-term memory neural network to identify their long term dependencies, an binary cross entropy to evaluate the quality of the neural networks. When the proposed method is tested with a realistic DNA binding protein dataset, it achieves a prediction accuracy of 94.2% at the Matthew's correlation coefficient of 0.961. Compared with the LibSVM on the arabidopsis and yeast datasets via independent tests, the accuracy raises by 9% and 4% respectively. Comparative experiments using different feature extraction methods show that our model performs similar accuracy with the best of others, but its values of sensitivity, specificity and AUC increase by 27.83%, 1.31% and 16.21% respectively. Those results suggest that our method is a promising tool for identifying DNA-binding proteins.
DNA结合蛋白在真核生物和原核生物蛋白质组的可变剪接、RNA编辑、甲基化及许多其他生物学功能中发挥着关键作用。从一级氨基酸序列预测这些蛋白质的功能正成为基因组功能注释中的主要挑战之一。传统的预测方法通常致力于从序列中提取物理化学特征,却忽略了基序信息以及基序之间的位置信息。同时,训练数据量小且噪声大导致预测的准确性和可靠性较低。在本文中,我们提出了一种基于深度学习的方法,仅从一级序列中识别DNA结合蛋白。该方法利用两个阶段的卷积神经网络来检测蛋白质序列的功能域,利用长短期记忆神经网络来识别其长期依赖性,并使用二元交叉熵来评估神经网络的质量。当使用真实的DNA结合蛋白数据集对所提出的方法进行测试时,在马修斯相关系数为0.961时,其预测准确率达到了94.2%。通过独立测试,与拟南芥和酵母数据集上的LibSVM相比,准确率分别提高了9%和4%。使用不同特征提取方法的对比实验表明,我们的模型与其他最佳模型的准确率相似,但其灵敏度、特异性和AUC值分别提高了27.83%、1.31%和16.21%。这些结果表明,我们的方法是识别DNA结合蛋白的一种有前途的工具。