State Key Laboratory of Biotherapy and Cancer Center/Collaborative Innovation Center of Biotherapy, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China.
Department of Computational Medicine and Bioinformatics.
Bioinformatics. 2020 Aug 15;36(16):4383-4388. doi: 10.1093/bioinformatics/btaa548.
Many protein function databases are built on automated or semi-automated curations and can contain various annotation errors. The correction of such misannotations is critical to improving the accuracy and reliability of the databases.
We proposed a new approach to detect potentially incorrect Gene Ontology (GO) annotations by comparing the ratio of annotation rates (RAR) for the same GO term across different taxonomic groups, where those with a relatively low RAR usually correspond to incorrect annotations. As an illustration, we applied the approach to 20 commonly studied species in two recent UniProt-GOA releases and identified 250 potential misannotations in the 2018-11-6 release, where only 25% of them were corrected in the 2019-6-3 release. Importantly, 56% of the misannotations are 'Inferred from Biological aspect of Ancestor (IBA)' which is in contradiction with previous observations that attributed misannotations mainly to 'Inferred from Sequence or structural Similarity (ISS)', probably reflecting an error source shift due to the new developments of function annotation databases. The results demonstrated a simple but efficient misannotation detection approach that is useful for large-scale comparative protein function studies.
https://zhanglab.ccmb.med.umich.edu/RAR.
Supplementary data are available at Bioinformatics online.
许多蛋白质功能数据库都是基于自动化或半自动化的注释构建的,可能包含各种注释错误。纠正这些错误注释对于提高数据库的准确性和可靠性至关重要。
我们提出了一种新方法,通过比较同一 GO 术语在不同分类群中的注释率比率 (RAR) 来检测潜在的不正确 GO 注释,其中 RAR 相对较低的通常对应于不正确的注释。作为说明,我们将该方法应用于最近 UniProt-GOA 发布的两个版本中的 20 个常见研究物种,并在 2018-11-6 版本中确定了 250 个潜在的错误注释,其中只有 25%在 2019-6-3 版本中得到了纠正。重要的是,错误注释中有 56%是“根据祖先的生物学方面推断(IBA)”,这与之前观察到的主要归因于“根据序列或结构相似性推断(ISS)”的错误注释相矛盾,这可能反映了由于功能注释数据库的新发展而导致的错误源转移。结果表明,这是一种简单但有效的错误注释检测方法,可用于大规模的比较蛋白质功能研究。
https://zhanglab.ccmb.med.umich.edu/RAR。
补充数据可在 Bioinformatics 在线获得。