Andorf Carson, Dobbs Drena, Honavar Vasant
BMC Bioinformatics. 2007 Aug 3;8:284. doi: 10.1186/1471-2105-8-284.
Incorrectly annotated sequence data are becoming more commonplace as databases increasingly rely on automated techniques for annotation. Hence, there is an urgent need for computational methods for checking consistency of such annotations against independent sources of evidence and detecting potential annotation errors. We show how a machine learning approach designed to automatically predict a protein's Gene Ontology (GO) functional class can be employed to identify potential gene annotation errors.
In a set of 211 previously annotated mouse protein kinases, we found that 201 of the GO annotations returned by AmiGO appear to be inconsistent with the UniProt functions assigned to their human counterparts. In contrast, 97% of the predicted annotations generated using a machine learning approach were consistent with the UniProt annotations of the human counterparts, as well as with available annotations for these mouse protein kinases in the Mouse Kinome database.
We conjecture that most of our predicted annotations are, therefore, correct and suggest that the machine learning approach developed here could be routinely used to detect potential errors in GO annotations generated by high-throughput gene annotation projects. Editors Note: Authors from the original publication (Okazaki et al.: Nature 2002, 420:563-73) have provided their response to Andorf et al, directly following the correspondence.
随着数据库越来越依赖自动化技术进行注释,注释错误的序列数据正变得越来越普遍。因此,迫切需要计算方法来检查此类注释与独立证据来源的一致性,并检测潜在的注释错误。我们展示了如何使用一种旨在自动预测蛋白质基因本体(GO)功能类别的机器学习方法来识别潜在的基因注释错误。
在一组211个先前注释的小鼠蛋白激酶中,我们发现AmiGO返回的201个GO注释似乎与其人类对应物在UniProt中分配的功能不一致。相比之下,使用机器学习方法生成的预测注释中有97%与人类对应物的UniProt注释以及小鼠激酶组数据库中这些小鼠蛋白激酶的可用注释一致。
因此,我们推测我们的大多数预测注释是正确的,并建议这里开发的机器学习方法可以常规用于检测高通量基因注释项目生成的GO注释中的潜在错误。编辑注:原始出版物(冈崎等人:《自然》2002年,420:563 - 73)的作者在通信之后直接提供了他们对安多夫等人的回应。