Bioengineering and Bioinformatics Research and Development Institute (IBB), FI-UNER, CONICET, Oro Verde 3100, Argentina.
Research Institute for Signals, Systems and Computational Intelligence (sinc(i)), FICH-UNL, CONICET, Ciudad Universitaria UNL, Santa Fe 3000, Argentina.
Bioinformatics. 2022 Sep 30;38(19):4488-4496. doi: 10.1093/bioinformatics/btac536.
Experimental testing and manual curation are the most precise ways for assigning Gene Ontology (GO) terms describing protein functions. However, they are expensive, time-consuming and cannot cope with the exponential growth of data generated by high-throughput sequencing methods. Hence, researchers need reliable computational systems to help fill the gap with automatic function prediction. The results of the last Critical Assessment of Function Annotation challenge revealed that GO-terms prediction remains a very challenging task. Recent developments on deep learning are significantly breaking out the frontiers leading to new knowledge in protein research thanks to the integration of data from multiple sources. However, deep models hitherto developed for functional prediction are mainly focused on sequence data and have not achieved breakthrough performances yet.
We propose DeeProtGO, a novel deep-learning model for predicting GO annotations by integrating protein knowledge. DeeProtGO was trained for solving 18 different prediction problems, defined by the three GO sub-ontologies, the type of proteins, and the taxonomic kingdom. Our experiments reported higher prediction quality when more protein knowledge is integrated. We also benchmarked DeeProtGO against state-of-the-art methods on public datasets, and showed it can effectively improve the prediction of GO annotations.
DeeProtGO and a case of use are available at https://github.com/gamerino/DeeProtGO.
Supplementary data are available at Bioinformatics online.
实验测试和人工注释是为蛋白质功能分配描述基因本体论 (GO) 术语的最精确方法。然而,它们既昂贵又耗时,并且无法应对高通量测序方法生成的数据的指数级增长。因此,研究人员需要可靠的计算系统来帮助填补自动功能预测的空白。上一次功能注释评估挑战赛的结果表明,GO 术语预测仍然是一项极具挑战性的任务。深度学习的最新发展通过整合来自多个来源的数据,大大突破了导致蛋白质研究新知识的前沿。然而,迄今为止为功能预测开发的深度模型主要侧重于序列数据,并且尚未取得突破性的性能。
我们提出了 DeeProtGO,这是一种通过整合蛋白质知识来预测 GO 注释的新型深度学习模型。DeeProtGO 经过训练可解决 18 种不同的预测问题,这些问题由三个 GO 子本体、蛋白质类型和分类单元定义。当整合更多蛋白质知识时,我们报告了更高的预测质量。我们还在公共数据集上针对最先进的方法对 DeeProtGO 进行了基准测试,并表明它可以有效地改进 GO 注释的预测。
DeeProtGO 和一个用例可在 https://github.com/gamerino/DeeProtGO 上获得。
补充数据可在 Bioinformatics 在线获得。