Department of Computer Engineering, METU, Ankara, 06800, Turkey.
Department of Computer Engineering, İskenderun Technical University, Hatay, 31200, Turkey.
Sci Rep. 2019 May 14;9(1):7344. doi: 10.1038/s41598-019-43708-3.
Automated protein function prediction is critical for the annotation of uncharacterized protein sequences, where accurate prediction methods are still required. Recently, deep learning based methods have outperformed conventional algorithms in computer vision and natural language processing due to the prevention of overfitting and efficient training. Here, we propose DEEPred, a hierarchical stack of multi-task feed-forward deep neural networks, as a solution to Gene Ontology (GO) based protein function prediction. DEEPred was optimized through rigorous hyper-parameter tests, and benchmarked using three types of protein descriptors, training datasets with varying sizes and GO terms form different levels. Furthermore, in order to explore how training with larger but potentially noisy data would change the performance, electronically made GO annotations were also included in the training process. The overall predictive performance of DEEPred was assessed using CAFA2 and CAFA3 challenge datasets, in comparison with the state-of-the-art protein function prediction methods. Finally, we evaluated selected novel annotations produced by DEEPred with a literature-based case study considering the 'biofilm formation process' in Pseudomonas aeruginosa. This study reports that deep learning algorithms have significant potential in protein function prediction; particularly when the source data is large. The neural network architecture of DEEPred can also be applied to the prediction of the other types of ontological associations. The source code and all datasets used in this study are available at: https://github.com/cansyl/DEEPred .
自动化蛋白质功能预测对于未被研究的蛋白质序列的注释至关重要,而准确的预测方法仍有待开发。最近,基于深度学习的方法在计算机视觉和自然语言处理方面已经超越了传统算法,因为它们可以防止过拟合并进行有效的训练。在这里,我们提出了 DEEPred,这是一种基于层次堆栈的多任务前馈深度神经网络,用于进行基于基因本体论(GO)的蛋白质功能预测。DEEPred 通过严格的超参数测试进行了优化,并使用三种类型的蛋白质描述符、具有不同大小的训练数据集和来自不同层次的 GO 术语进行了基准测试。此外,为了探索使用更大但可能存在噪声的数据进行训练会如何改变性能,我们还将电子生成的 GO 注释纳入了训练过程。我们使用 CAFA2 和 CAFA3 挑战数据集来评估 DEEPred 的整体预测性能,并与最先进的蛋白质功能预测方法进行了比较。最后,我们考虑了铜绿假单胞菌中的“生物膜形成过程”,通过基于文献的案例研究评估了 DEEPred 生成的选定新注释。这项研究表明,深度学习算法在蛋白质功能预测方面具有很大的潜力;特别是在源数据较大的情况下。DEEPred 的神经网络架构也可以应用于其他类型的本体关联的预测。本研究中使用的源代码和所有数据集均可在:https://github.com/cansyl/DEEPred 获得。