CrowdGO：基于机器学习和语义相似性的共识基因本体论注释。

CrowdGO: Machine learning and semantic similarity guided consensus Gene Ontology annotation.

机构信息

Department of Ecology and Evolution, University of Lausanne, and Swiss Institute of Bioinformatics, Lausanne, Switzerland.

出版信息

PLoS Comput Biol. 2022 May 13;18(5):e1010075. doi: 10.1371/journal.pcbi.1010075. eCollection 2022 May.

DOI:10.1371/journal.pcbi.1010075

PMID:35560159

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9132264/

Abstract

Characterising gene function for the ever-increasing number and diversity of species with annotated genomes relies almost entirely on computational prediction methods. These software are also numerous and diverse, each with different strengths and weaknesses as revealed through community benchmarking efforts. Meta-predictors that assess consensus and conflict from individual algorithms should deliver enhanced functional annotations. To exploit the benefits of meta-approaches, we developed CrowdGO, an open-source consensus-based Gene Ontology (GO) term meta-predictor that employs machine learning models with GO term semantic similarities and information contents. By re-evaluating each gene-term annotation, a consensus dataset is produced with high-scoring confident annotations and low-scoring rejected annotations. Applying CrowdGO to results from a deep learning-based, a sequence similarity-based, and two protein domain-based methods, delivers consensus annotations with improved precision and recall. Furthermore, using standard evaluation measures CrowdGO performance matches that of the community's best performing individual methods. CrowdGO therefore offers a model-informed approach to leverage strengths of individual predictors and produce comprehensive and accurate gene functional annotations.

摘要

对具有注释基因组的越来越多和多样化的物种的基因功能进行特征描述几乎完全依赖于计算预测方法。这些软件也很多样化，每个软件都有不同的优缺点，这在社区基准测试工作中得到了体现。评估来自各个算法的共识和冲突的元预测器应该提供增强的功能注释。为了利用元方法的优势，我们开发了 CrowdGO，这是一种基于共识的基因本体论 (GO) 术语元预测器，它使用具有 GO 术语语义相似性和信息内容的机器学习模型。通过重新评估每个基因术语注释，使用高分置信度注释和低分拒绝注释生成共识数据集。将 CrowdGO 应用于基于深度学习、基于序列相似性和基于两种蛋白质结构域的方法的结果，可提供具有改进精度和召回率的共识注释。此外，使用标准评估指标，CrowdGO 的性能与社区表现最好的单个方法相匹配。因此，CrowdGO 提供了一种基于模型的方法，可以利用各个预测器的优势，并生成全面准确的基因功能注释。