探索全基因组蛋白质功能注释中的不一致性：一种机器学习方法。

Exploring inconsistencies in genome-wide protein function annotations: a machine learning approach.

作者信息

Andorf Carson, Dobbs Drena, Honavar Vasant

出版信息

BMC Bioinformatics. 2007 Aug 3;8:284. doi: 10.1186/1471-2105-8-284.

DOI:10.1186/1471-2105-8-284

PMID:17683567

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1994202/

Abstract

BACKGROUND

Incorrectly annotated sequence data are becoming more commonplace as databases increasingly rely on automated techniques for annotation. Hence, there is an urgent need for computational methods for checking consistency of such annotations against independent sources of evidence and detecting potential annotation errors. We show how a machine learning approach designed to automatically predict a protein's Gene Ontology (GO) functional class can be employed to identify potential gene annotation errors.

RESULTS

In a set of 211 previously annotated mouse protein kinases, we found that 201 of the GO annotations returned by AmiGO appear to be inconsistent with the UniProt functions assigned to their human counterparts. In contrast, 97% of the predicted annotations generated using a machine learning approach were consistent with the UniProt annotations of the human counterparts, as well as with available annotations for these mouse protein kinases in the Mouse Kinome database.

CONCLUSION

We conjecture that most of our predicted annotations are, therefore, correct and suggest that the machine learning approach developed here could be routinely used to detect potential errors in GO annotations generated by high-throughput gene annotation projects. Editors Note: Authors from the original publication (Okazaki et al.: Nature 2002, 420:563-73) have provided their response to Andorf et al, directly following the correspondence.

摘要

背景

随着数据库越来越依赖自动化技术进行注释，注释错误的序列数据正变得越来越普遍。因此，迫切需要计算方法来检查此类注释与独立证据来源的一致性，并检测潜在的注释错误。我们展示了如何使用一种旨在自动预测蛋白质基因本体（GO）功能类别的机器学习方法来识别潜在的基因注释错误。

结果

在一组211个先前注释的小鼠蛋白激酶中，我们发现AmiGO返回的201个GO注释似乎与其人类对应物在UniProt中分配的功能不一致。相比之下，使用机器学习方法生成的预测注释中有97%与人类对应物的UniProt注释以及小鼠激酶组数据库中这些小鼠蛋白激酶的可用注释一致。

结论

因此，我们推测我们的大多数预测注释是正确的，并建议这里开发的机器学习方法可以常规用于检测高通量基因注释项目生成的GO注释中的潜在错误。编辑注：原始出版物（冈崎等人：《自然》2002年，420:563 - 73）的作者在通信之后直接提供了他们对安多夫等人的回应。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4336/1994202/84fbdc20b036/1471-2105-8-284-1.jpg

相似文献

Exploring inconsistencies in genome-wide protein function annotations: a machine learning approach.

BMC Bioinformatics. 2007 Aug 3;8:284. doi: 10.1186/1471-2105-8-284.

Mining GO annotations for improving annotation consistency.

PLoS One. 2012;7(7):e40519. doi: 10.1371/journal.pone.0040519. Epub 2012 Jul 25.

Improving the Caenorhabditis elegans genome annotation using machine learning.

PLoS Comput Biol. 2007 Feb 23;3(2):e20. doi: 10.1371/journal.pcbi.0030020. Epub 2006 Dec 21.

The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation.

BMC Bioinformatics. 2008 Jan 25;9:52. doi: 10.1186/1471-2105-9-52.

Automated protein subfamily identification and classification.

PLoS Comput Biol. 2007 Aug;3(8):e160. doi: 10.1371/journal.pcbi.0030160.

Information theory applied to the sparse gene ontology annotation network to predict novel gene function.

Bioinformatics. 2007 Jul 1;23(13):i529-38. doi: 10.1093/bioinformatics/btm195.

Annotating proteins by mining protein interaction networks.

Bioinformatics. 2006 Jul 15;22(14):e260-70. doi: 10.1093/bioinformatics/btl221.

GOChase-II: correcting semantic inconsistencies from Gene Ontology-based annotations for gene products.

BMC Bioinformatics. 2011 Feb 15;12 Suppl 1(Suppl 1):S40. doi: 10.1186/1471-2105-12-S1-S40.

Automatic discovery of cross-family sequence features associated with protein function.

BMC Bioinformatics. 2006 Jan 12;7:16. doi: 10.1186/1471-2105-7-16.

Learning to translate sequence and structure to function: identifying DNA binding and membrane binding proteins.

Ann Biomed Eng. 2007 Jun;35(6):1043-52. doi: 10.1007/s10439-007-9312-z. Epub 2007 Apr 13.

引用本文的文献

Predicting Tissue-Specific mRNA and Protein Abundance in Maize: A Machine Learning Approach.

Front Artif Intell. 2022 May 26;5:830170. doi: 10.3389/frai.2022.830170. eCollection 2022.

Detecting Gene Ontology misannotations using taxon-specific rate ratio comparisons.

Bioinformatics. 2020 Aug 15;36(16):4383-4388. doi: 10.1093/bioinformatics/btaa548.

Prevalence and Implications of Contamination in Public Genomic Resources: A Case Study of 43 Reference Arthropod Assemblies.

G3 (Bethesda). 2020 Feb 6;10(2):721-730. doi: 10.1534/g3.119.400758.

SamPler - a novel method for selecting parameters for gene functional annotation routines.

BMC Bioinformatics. 2019 Sep 5;20(1):454. doi: 10.1186/s12859-019-3038-4.

Maize GO Annotation-Methods, Evaluation, and Review (maize-GAMER).

Plant Direct. 2018 Apr 11;2(4):e00052. doi: 10.1002/pld3.52. eCollection 2018 Apr.

Interactome-Seq: A Protocol for Domainome Library Construction, Validation and Selection by Phage Display and Next Generation Sequencing.

J Vis Exp. 2018 Oct 3(140):56981. doi: 10.3791/56981.

Detection of gene annotations and protein-protein interaction associated disorders through transitive relationships between integrated annotations.

BMC Genomics. 2015;16(Suppl 6):S5. doi: 10.1186/1471-2164-16-S6-S5. Epub 2015 Jun 1.

Gene networks underlying convergent and pleiotropic phenotypes in a large and systematically-phenotyped cohort with heterogeneous developmental disorders.

PLoS Genet. 2015 Mar 17;11(3):e1005012. doi: 10.1371/journal.pgen.1005012. eCollection 2015 Mar.

In silico assigned resistance genes confer Bifidobacterium with partial resistance to aminoglycosides but not to β-lactams.

PLoS One. 2013 Dec 6;8(12):e82653. doi: 10.1371/journal.pone.0082653. eCollection 2013.

Predicting the binding patterns of hub proteins: a study using yeast protein interaction networks.

PLoS One. 2013;8(2):e56833. doi: 10.1371/journal.pone.0056833. Epub 2013 Feb 19.

本文引用的文献

Estimating the annotation error rate of curated GO database sequence annotations.

BMC Bioinformatics. 2007 May 22;8:170. doi: 10.1186/1471-2105-8-170.

Probabilistic protein function prediction from heterogeneous genome-wide data.

PLoS One. 2007 Mar 28;2(3):e337. doi: 10.1371/journal.pone.0000337.

Globally predicting protein functions based on co-expressed protein-protein interaction networks and ontology taxonomy similarities.

Gene. 2007 Apr 15;391(1-2):113-9. doi: 10.1016/j.gene.2006.12.008. Epub 2006 Dec 22.

Machine learning in bioinformatics.

Brief Bioinform. 2006 Mar;7(1):86-112. doi: 10.1093/bib/bbk007.

Genome wide prediction of protein function via a generic knowledge discovery approach based on evidence integration.

BMC Bioinformatics. 2006 May 25;7:268. doi: 10.1186/1471-2105-7-268.

GOPET: a tool for automated predictions of Gene Ontology terms.

BMC Bioinformatics. 2006 Mar 20;7:161. doi: 10.1186/1471-2105-7-161.

Hierarchical multi-label prediction of gene function.

Bioinformatics. 2006 Apr 1;22(7):830-6. doi: 10.1093/bioinformatics/btk048. Epub 2006 Jan 12.

The Gene Ontology (GO) project in 2006.

Nucleic Acids Res. 2006 Jan 1;34(Database issue):D322-6. doi: 10.1093/nar/gkj021.

Probabilistic annotation of protein sequences based on functional classifications.

BMC Bioinformatics. 2005 Dec 14;6:302. doi: 10.1186/1471-2105-6-302.

Genome annotation errors in pathway databases due to semantic ambiguity in partial EC numbers.

Nucleic Acids Res. 2005 Jul 20;33(13):4035-9. doi: 10.1093/nar/gki711. Print 2005.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

探索全基因组蛋白质功能注释中的不一致性：一种机器学习方法。

Exploring inconsistencies in genome-wide protein function annotations: a machine learning approach.

作者信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献