Department of Structural and Molecular Biology, University College London, Gower Street, London WC1E 6BT, United Kingdom; Department of Computer Information Systems, University of Malta, Faculty of ICT, Msida, MSD 2080, Malta.
Department of Structural and Molecular Biology, University College London, Gower Street, London WC1E 6BT, United Kingdom.
Biochim Biophys Acta Proteins Proteom. 2024 Feb 1;1872(2):140985. doi: 10.1016/j.bbapap.2023.140985. Epub 2023 Dec 19.
The growth of unannotated proteins in UniProt increases at a very high rate every year due to more efficient sequencing methods. However, the experimental annotation of proteins is a lengthy and expensive process. Using computational techniques to narrow the search can speed up the process by providing highly specific Gene Ontology (GO) terms.
We propose an ensemble approach that combines three generic base predictors that predict Gene Ontology (BP, CC and MF) terms from sequences across different species. We train our models on UniProtGOA annotation data and use the CATH domain resources to identify the protein families. We then calculate a score based on the prevalence of individual GO terms in the functional families that is then used as an indicator of confidence when assigning the GO term to an uncharacterised protein.
In the ensemble, we use a statistics-based method that scores the occurrence of GO terms in a CATH FunFam against a background set of proteins annotated by the same GO term. We also developed a set-based method that uses Set Intersection and Set Union to score the occurrence of GO terms within the same CATH FunFam. Finally, we also use FunFams-Plus, a predictor method developed by the Orengo Group at UCL to predict GO terms for uncharacterised proteins in the CAFA3 challenge.
We evaluated the methods against the CAFA3 benchmark and DomFun. We used the Precision, Recall and F metrics and the benchmark datasets that are used in CAFA3 to evaluate our models and compare them to the CAFA3 results. Our results show that FunPredCATH compares well with top CAFA methods in the different ontologies and benchmarks.
FunPredCATH compares well with other prediction methods on CAFA3, and the ensemble approach outperforms the base methods. We show that non-IEA models obtain higher F scores than the IEA counterparts, while the models including IEA annotations have higher coverage at the expense of a lower F score.
由于更有效的测序方法,每年 UniProt 中未注释蛋白质的数量都以非常高的速度增长。然而,蛋白质的实验注释是一个漫长而昂贵的过程。使用计算技术来缩小搜索范围可以通过提供高度特定的基因本体 (GO) 术语来加速这个过程。
我们提出了一种集成方法,该方法结合了三种通用的基本预测器,这些预测器可以从不同物种的序列中预测基因本体 (BP、CC 和 MF) 术语。我们在 UniProtGOA 注释数据上训练模型,并使用 CATH 结构域资源来识别蛋白质家族。然后,我们根据功能家族中个体 GO 术语的出现频率计算一个分数,然后将该分数用作将 GO 术语分配给未表征蛋白质时的置信度指标。
在集成中,我们使用基于统计的方法来对 CATH FunFam 中 GO 术语的出现情况进行评分,该方法针对具有相同 GO 术语注释的蛋白质的背景集进行评分。我们还开发了一种基于集合的方法,该方法使用集合交集和集合并集来对同一 CATH FunFam 内的 GO 术语的出现情况进行评分。最后,我们还使用 UCL 的 Orengo 小组开发的 FunFams-Plus 预测方法来预测 CAFA3 挑战赛中未表征蛋白质的 GO 术语。
我们针对 CAFA3 基准和 DomFun 评估了这些方法。我们使用 CAFA3 中使用的精度、召回率和 F 度量以及基准数据集来评估我们的模型,并将其与 CAFA3 结果进行比较。我们的结果表明,FunPredCATH 在不同的本体和基准测试中与 CAFA 顶级方法相比表现良好。
FunPredCATH 在 CAFA3 上与其他预测方法相比表现良好,集成方法优于基础方法。我们表明,非 IEA 模型的 F 分数高于 IEA 对应物,而包括 IEA 注释的模型以较低的 F 分数为代价具有更高的覆盖率。