Department of Informatics, Faculty of Sciences, University of Lisbon, Lisbon, Portugal.
PLoS One. 2012;7(7):e40519. doi: 10.1371/journal.pone.0040519. Epub 2012 Jul 25.
Despite the structure and objectivity provided by the Gene Ontology (GO), the annotation of proteins is a complex task that is subject to errors and inconsistencies. Electronically inferred annotations in particular are widely considered unreliable. However, given that manual curation of all GO annotations is unfeasible, it is imperative to improve the quality of electronically inferred annotations. In this work, we analyze the full GO molecular function annotation of UniProtKB proteins, and discuss some of the issues that affect their quality, focusing particularly on the lack of annotation consistency. Based on our analysis, we estimate that 64% of the UniProtKB proteins are incompletely annotated, and that inconsistent annotations affect 83% of the protein functions and at least 23% of the proteins. Additionally, we present and evaluate a data mining algorithm, based on the association rule learning methodology, for identifying implicit relationships between molecular function terms. The goal of this algorithm is to assist GO curators in updating GO and correcting and preventing inconsistent annotations. Our algorithm predicted 501 relationships with an estimated precision of 94%, whereas the basic association rule learning methodology predicted 12,352 relationships with a precision below 9%.
尽管基因本体论 (GO) 提供了结构和客观性,但蛋白质的注释是一项复杂的任务,容易出现错误和不一致。特别是电子推断的注释被广泛认为是不可靠的。然而,鉴于对所有 GO 注释进行手动整理是不可行的,因此必须提高电子推断注释的质量。在这项工作中,我们分析了 UniProtKB 蛋白质的完整 GO 分子功能注释,并讨论了一些影响其质量的问题,特别是缺乏注释一致性的问题。根据我们的分析,我们估计 64%的 UniProtKB 蛋白质没有得到完整注释,并且不一致的注释会影响 83%的蛋白质功能和至少 23%的蛋白质。此外,我们提出并评估了一种基于关联规则学习方法的数据挖掘算法,用于识别分子功能术语之间的隐含关系。该算法的目标是帮助 GO 注释人员更新 GO,并纠正和防止不一致的注释。我们的算法预测了 501 种关系,估计精度为 94%,而基本的关联规则学习方法预测了 12352 种关系,精度低于 9%。