挖掘 GO 注释以提高注释一致性。

Mining GO annotations for improving annotation consistency.

机构信息

Department of Informatics, Faculty of Sciences, University of Lisbon, Lisbon, Portugal.

出版信息

PLoS One. 2012;7(7):e40519. doi: 10.1371/journal.pone.0040519. Epub 2012 Jul 25.

DOI:10.1371/journal.pone.0040519

PMID:22848383

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3405096/

Abstract

Despite the structure and objectivity provided by the Gene Ontology (GO), the annotation of proteins is a complex task that is subject to errors and inconsistencies. Electronically inferred annotations in particular are widely considered unreliable. However, given that manual curation of all GO annotations is unfeasible, it is imperative to improve the quality of electronically inferred annotations. In this work, we analyze the full GO molecular function annotation of UniProtKB proteins, and discuss some of the issues that affect their quality, focusing particularly on the lack of annotation consistency. Based on our analysis, we estimate that 64% of the UniProtKB proteins are incompletely annotated, and that inconsistent annotations affect 83% of the protein functions and at least 23% of the proteins. Additionally, we present and evaluate a data mining algorithm, based on the association rule learning methodology, for identifying implicit relationships between molecular function terms. The goal of this algorithm is to assist GO curators in updating GO and correcting and preventing inconsistent annotations. Our algorithm predicted 501 relationships with an estimated precision of 94%, whereas the basic association rule learning methodology predicted 12,352 relationships with a precision below 9%.

摘要

尽管基因本体论 (GO) 提供了结构和客观性，但蛋白质的注释是一项复杂的任务，容易出现错误和不一致。特别是电子推断的注释被广泛认为是不可靠的。然而，鉴于对所有 GO 注释进行手动整理是不可行的，因此必须提高电子推断注释的质量。在这项工作中，我们分析了 UniProtKB 蛋白质的完整 GO 分子功能注释，并讨论了一些影响其质量的问题，特别是缺乏注释一致性的问题。根据我们的分析，我们估计 64%的 UniProtKB 蛋白质没有得到完整注释，并且不一致的注释会影响 83%的蛋白质功能和至少 23%的蛋白质。此外，我们提出并评估了一种基于关联规则学习方法的数据挖掘算法，用于识别分子功能术语之间的隐含关系。该算法的目标是帮助 GO 注释人员更新 GO，并纠正和防止不一致的注释。我们的算法预测了 501 种关系，估计精度为 94%，而基本的关联规则学习方法预测了 12352 种关系，精度低于 9%。

相似文献

Mining GO annotations for improving annotation consistency.挖掘 GO 注释以提高注释一致性。

PLoS One. 2012;7(7):e40519. doi: 10.1371/journal.pone.0040519. Epub 2012 Jul 25.

Quality of computationally inferred gene ontology annotations.计算推断的基因本体论注释的质量。

PLoS Comput Biol. 2012 May;8(5):e1002533. doi: 10.1371/journal.pcbi.1002533. Epub 2012 May 31.

Automatic consistency assurance for literature-based gene ontology annotation.基于文献的基因本体论自动一致性保证。

BMC Bioinformatics. 2021 Nov 25;22(1):565. doi: 10.1186/s12859-021-04479-9.

GOChase-II: correcting semantic inconsistencies from Gene Ontology-based annotations for gene products.GOChase-II：纠正基于基因本体论注释的基因产物中的语义不一致性。

BMC Bioinformatics. 2011 Feb 15;12 Suppl 1(Suppl 1):S40. doi: 10.1186/1471-2105-12-S1-S40.

CvManGO, a method for leveraging computational predictions to improve literature-based Gene Ontology annotations.CvManGO，一种利用计算预测来改进基于文献的基因本体论注释的方法。

Database (Oxford). 2012 Mar 20;2012:bas001. doi: 10.1093/database/bas001. Print 2012.

The UniProt-GO Annotation database in 2011.2011 年的 UniProt-GO Annotation 数据库。

Nucleic Acids Res. 2012 Jan;40(Database issue):D565-70. doi: 10.1093/nar/gkr1048. Epub 2011 Nov 28.

The GOA database: gene Ontology annotation updates for 2015.基因本体注释数据库（GOA）：2015年基因本体注释更新

Nucleic Acids Res. 2015 Jan;43(Database issue):D1057-63. doi: 10.1093/nar/gku1113. Epub 2014 Nov 6.

Large-scale inference of gene function through phylogenetic annotation of Gene Ontology terms: case study of the apoptosis and autophagy cellular processes.通过基因本体术语的系统发育注释对基因功能进行大规模推断：细胞凋亡和自噬细胞过程的案例研究

Database (Oxford). 2016 Dec 26;2016. doi: 10.1093/database/baw155. Print 2016.

Best Practices in Manual Annotation with the Gene Ontology.使用基因本体进行人工注释的最佳实践

Methods Mol Biol. 2017;1446:41-54. doi: 10.1007/978-1-4939-3743-1_4.

Use of Chou's 5-steps rule to predict the subcellular localization of gram-negative and gram-positive bacterial proteins by multi-label learning based on gene ontology annotation and profile alignment.利用 Chou 的 5 步规则，通过基于基因本体论注释和序列比对的多标签学习，预测革兰氏阴性和革兰氏阳性细菌蛋白质的亚细胞定位。

J Integr Bioinform. 2020 Jun 29;18(1):51-79. doi: 10.1515/jib-2019-0091.

引用本文的文献

Integration of background knowledge for automatic detection of inconsistencies in gene ontology annotation.背景知识的整合用于自动检测基因本体论注释中的不一致性。

Bioinformatics. 2024 Jun 28;40(Suppl 1):i390-i400. doi: 10.1093/bioinformatics/btae246.

Exploring automatic inconsistency detection for literature-based gene ontology annotation.探索基于文献的基因本体论自动标注不一致性检测。

Bioinformatics. 2022 Jun 24;38(Suppl 1):i273-i281. doi: 10.1093/bioinformatics/btac230.

GSAn: an alternative to enrichment analysis for annotating gene sets.GSAn：一种用于注释基因集的富集分析替代方法。

NAR Genom Bioinform. 2020 Mar 14;2(2):lqaa017. doi: 10.1093/nargab/lqaa017. eCollection 2020 Jun.

Optimizing gene set annotations combining GO structure and gene expression data.结合基因本体结构和基因表达数据优化基因集注释

BMC Syst Biol. 2018 Dec 31;12(Suppl 9):133. doi: 10.1186/s12918-018-0659-6.

A new method for evaluating the impacts of semantic similarity measures on the annotation of gene sets.一种评估语义相似性度量对基因集注释影响的新方法。

PLoS One. 2018 Nov 27;13(11):e0208037. doi: 10.1371/journal.pone.0208037. eCollection 2018.

Positive and relaxed selection associated with flight evolution and loss in insect transcriptomes.积极和放松的选择与昆虫转录组的飞行进化和丧失有关。

Gigascience. 2017 Oct 1;6(10):1-14. doi: 10.1093/gigascience/gix073.

NoGOA: predicting noisy GO annotations using evidences and sparse representation.NoGOA：利用证据和稀疏表示预测有噪声的基因本体注释

BMC Bioinformatics. 2017 Jul 21;18(1):350. doi: 10.1186/s12859-017-1764-z.

An evidence-based approach to identify aging-related genes in Caenorhabditis elegans.一种基于证据的方法来鉴定秀丽隐杆线虫中与衰老相关的基因。

BMC Bioinformatics. 2015 Feb 7;16(1):40. doi: 10.1186/s12859-015-0469-4.

ComPPI: a cellular compartment-specific database for protein-protein interaction network analysis.ComPPI：用于蛋白质-蛋白质相互作用网络分析的细胞区室特异性数据库。

Nucleic Acids Res. 2015 Jan;43(Database issue):D485-93. doi: 10.1093/nar/gku1007. Epub 2014 Oct 27.

Optimization of gene set annotations via entropy minimization over variable clusters (EMVC).通过对可变聚类进行熵最小化（EMVC）优化基因集注释。

Bioinformatics. 2014 Jun 15;30(12):1698-706. doi: 10.1093/bioinformatics/btu110. Epub 2014 Feb 25.

本文引用的文献

How the gene ontology evolves.基因本体论的演变。

BMC Bioinformatics. 2011 Aug 5;12:325. doi: 10.1186/1471-2105-12-325.

BRENDA, the enzyme information system in 2011.布伦达，2011年的酶信息系统。

Nucleic Acids Res. 2011 Jan;39(Database issue):D670-6. doi: 10.1093/nar/gkq1089. Epub 2010 Nov 9.

Cross-product extensions of the Gene Ontology.基因本体论的叉积扩展。

J Biomed Inform. 2011 Feb;44(1):80-6. doi: 10.1016/j.jbi.2010.02.002. Epub 2010 Feb 10.

Ontology engineering.本体工程。

Nat Biotechnol. 2010 Feb;28(2):128-30. doi: 10.1038/nbt0210-128.

FunSimMat update: new features for exploring functional similarity.FunSimMat 更新：探索功能相似性的新功能。

Nucleic Acids Res. 2010 Jan;38(Database issue):D244-8. doi: 10.1093/nar/gkp979. Epub 2009 Nov 18.

The Universal Protein Resource (UniProt) in 2010.2010 年的通用蛋白质资源（UniProt）。

Nucleic Acids Res. 2010 Jan;38(Database issue):D142-8. doi: 10.1093/nar/gkp846. Epub 2009 Oct 20.

The impact of incomplete knowledge on evaluation: an experimental benchmark for protein function prediction.不完全知识对评估的影响：蛋白质功能预测的实验基准

Bioinformatics. 2009 Sep 15;25(18):2404-10. doi: 10.1093/bioinformatics/btp397. Epub 2009 Jun 26.

The GOA database in 2009--an integrated Gene Ontology Annotation resource.2009年的基因本体注释（GOA）数据库——一个整合的基因本体注释资源。

Nucleic Acids Res. 2009 Jan;37(Database issue):D396-403. doi: 10.1093/nar/gkn803. Epub 2008 Oct 27.

InterPro: the integrative protein signature database.InterPro：综合蛋白质特征数据库。

Nucleic Acids Res. 2009 Jan;37(Database issue):D211-5. doi: 10.1093/nar/gkn785. Epub 2008 Oct 21.

Estimating the annotation error rate of curated GO database sequence annotations.估计经过整理的基因本体论（GO）数据库序列注释的注释错误率。

BMC Bioinformatics. 2007 May 22;8:170. doi: 10.1186/1471-2105-8-170.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。