Suppr
超能文献

公共数据库中的注释错误：酶超家族中分子功能的错误注释。

Annotation error in public databases: misannotation of molecular function in enzyme superfamilies.

机构信息

Graduate Group in Biophysics, University of California San Francisco, San Francisco, California, United States of America.

出版信息

PLoS Comput Biol. 2009 Dec;5(12):e1000605. doi: 10.1371/journal.pcbi.1000605. Epub 2009 Dec 11.

DOI:10.1371/journal.pcbi.1000605

PMID:20011109

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2781113/

Abstract

Due to the rapid release of new data from genome sequencing projects, the majority of protein sequences in public databases have not been experimentally characterized; rather, sequences are annotated using computational analysis. The level of misannotation and the types of misannotation in large public databases are currently unknown and have not been analyzed in depth. We have investigated the misannotation levels for molecular function in four public protein sequence databases (UniProtKB/Swiss-Prot, GenBank NR, UniProtKB/TrEMBL, and KEGG) for a model set of 37 enzyme families for which extensive experimental information is available. The manually curated database Swiss-Prot shows the lowest annotation error levels (close to 0% for most families); the two other protein sequence databases (GenBank NR and TrEMBL) and the protein sequences in the KEGG pathways database exhibit similar and surprisingly high levels of misannotation that average 5%-63% across the six superfamilies studied. For 10 of the 37 families examined, the level of misannotation in one or more of these databases is >80%. Examination of the NR database over time shows that misannotation has increased from 1993 to 2005. The types of misannotation that were found fall into several categories, most associated with "overprediction" of molecular function. These results suggest that misannotation in enzyme superfamilies containing multiple families that catalyze different reactions is a larger problem than has been recognized. Strategies are suggested for addressing some of the systematic problems contributing to these high levels of misannotation.

摘要

由于基因组测序项目不断快速发布新数据，公共数据库中的大多数蛋白质序列尚未经过实验验证；相反，这些序列是使用计算分析进行注释的。目前还不知道公共大型数据库中的错误注释水平和错误注释类型，也没有对其进行深入分析。我们针对四个公共蛋白质序列数据库（UniProtKB/Swiss-Prot、GenBank NR、UniProtKB/TrEMBL 和 KEGG）中的分子功能错误注释水平进行了研究，这些数据库的研究模型集为 37 个酶家族，这些酶家族拥有广泛的实验信息。经过人工注释的数据库 Swiss-Prot 显示出最低的注释错误水平（对于大多数家族，接近 0%）；另外两个蛋白质序列数据库（GenBank NR 和 TrEMBL）和 KEGG 途径数据库中的蛋白质序列显示出相似且令人惊讶的高错误注释水平，在研究的六个超家族中平均为 5%-63%。在所检查的 37 个家族中的 10 个家族中，一个或多个数据库中的错误注释水平>80%。对 NR 数据库进行的时间研究表明，错误注释从 1993 年到 2005 年有所增加。发现的错误注释类型分为几类，大多数与分子功能的“过度预测”有关。这些结果表明，在包含催化不同反应的多个家族的酶超家族中，错误注释是一个比以前认识到的更大的问题。我们提出了一些策略来解决导致这些高错误注释水平的系统问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3f9a/2781113/732434da2a3c/pcbi.1000605.g001.jpg

相似文献

Annotation error in public databases: misannotation of molecular function in enzyme superfamilies.

PLoS Comput Biol. 2009 Dec;5(12):e1000605. doi: 10.1371/journal.pcbi.1000605. Epub 2009 Dec 11.

CGKB: an annotation knowledge base for cowpea (Vigna unguiculata L.) methylation filtered genomic genespace sequences.

BMC Bioinformatics. 2007 Apr 19;8:129. doi: 10.1186/1471-2105-8-129.

The UniProtKB/Swiss-Prot knowledgebase and its Plant Proteome Annotation Program.

J Proteomics. 2009 Apr 13;72(3):567-73. doi: 10.1016/j.jprot.2008.11.010. Epub 2008 Nov 24.

Protein sequence annotation in the genome era: the annotation concept of SWISS-PROT+TREMBL.

Proc Int Conf Intell Syst Mol Biol. 1997;5:33-43.

UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View.

Methods Mol Biol. 2016;1374:23-54. doi: 10.1007/978-1-4939-3167-5_2.

UniProtKB/Swiss-Prot.

Methods Mol Biol. 2007;406:89-112. doi: 10.1007/978-1-59745-535-0_4.

UniSave: the UniProtKB sequence/annotation version database.

Bioinformatics. 2006 May 15;22(10):1284-5. doi: 10.1093/bioinformatics/btl105. Epub 2006 Mar 21.

Database verification studies of SWISS-PROT and GenBank.

Bioinformatics. 2001 Jun;17(6):526-32; discussion 533-4. doi: 10.1093/bioinformatics/17.6.526.

Experimental and computational investigation of enzyme functional annotations uncovers misannotation in the EC 1.1.3.15 enzyme class.

PLoS Comput Biol. 2021 Sep 23;17(9):e1009446. doi: 10.1371/journal.pcbi.1009446. eCollection 2021 Sep.

HAMAP: a database of completely sequenced microbial proteome sets and manually curated microbial protein families in UniProtKB/Swiss-Prot.

Nucleic Acids Res. 2009 Jan;37(Database issue):D471-8. doi: 10.1093/nar/gkn661. Epub 2008 Oct 11.

引用本文的文献

Fold first, ask later: structure-informed function annotation of phage proteins.

bioRxiv. 2025 Jul 20:2025.07.17.665397. doi: 10.1101/2025.07.17.665397.

Stress testing reveals selective vulnerabilities in protein homeostasis.

bioRxiv. 2025 Jun 16:2025.06.11.659168. doi: 10.1101/2025.06.11.659168.

Refinement of the Reference Viral Database (RVDB) for improving bioinformatics analysis of virus detection by high-throughput sequencing (HTS).

mSphere. 2025 Jul 29;10(7):e0028625. doi: 10.1128/msphere.00286-25. Epub 2025 Jun 23.

Sequence and taxonomic feature evaluation facilitated the discovery of alcohol oxidases.

Synth Syst Biotechnol. 2025 Apr 22;10(3):907-915. doi: 10.1016/j.synbio.2025.04.014. eCollection 2025 Sep.

A longitudinal analysis of function annotations of the human proteome reveals consistently high biases.

Database (Oxford). 2025 May 7;2025. doi: 10.1093/database/baaf036.

Intramolecular epistasis correlates with divergence of specificity in promiscuous and bifunctional NSAR/OSBS enzymes.

Protein Sci. 2025 May;34(5):e70113. doi: 10.1002/pro.70113.

Biological databases in the age of generative artificial intelligence.

Bioinform Adv. 2025 Mar 20;5(1):vbaf044. doi: 10.1093/bioadv/vbaf044. eCollection 2025.

Functional Annotation and Structural Characterization of Hypothetical Proteins in and Isolated from Honey.

ACS Omega. 2025 Feb 27;10(9):8993-9006. doi: 10.1021/acsomega.4c07105. eCollection 2025 Mar 11.

High-throughput protein characterization by complementation using DNA barcoded fragment libraries.

Mol Syst Biol. 2024 Nov;20(11):1207-1229. doi: 10.1038/s44320-024-00068-z. Epub 2024 Oct 7.

Interactive tools for functional annotation of bacterial genomes.

Database (Oxford). 2024 Sep 6;2024. doi: 10.1093/database/baae089.

本文引用的文献

Protein function prediction--the power of multiplicity.

Trends Biotechnol. 2009 Apr;27(4):210-9. doi: 10.1016/j.tibtech.2009.01.002. Epub 2009 Feb 27.

Using sequence similarity networks for visualization of relationships across diverse protein superfamilies.

PLoS One. 2009;4(2):e4345. doi: 10.1371/journal.pone.0004345. Epub 2009 Feb 3.

Annotating proteins with generalized functional linkages.

Proc Natl Acad Sci U S A. 2008 Nov 18;105(46):17700-5. doi: 10.1073/pnas.0809583105. Epub 2008 Nov 12.

GenBank.

Nucleic Acids Res. 2009 Jan;37(Database issue):D26-31. doi: 10.1093/nar/gkn723. Epub 2008 Oct 21.

InterPro: the integrative protein signature database.

Nucleic Acids Res. 2009 Jan;37(Database issue):D211-5. doi: 10.1093/nar/gkn785. Epub 2008 Oct 21.

The Universal Protein Resource (UniProt) 2009.

Nucleic Acids Res. 2009 Jan;37(Database issue):D169-74. doi: 10.1093/nar/gkn664. Epub 2008 Oct 4.

Toward an online repository of Standard Operating Procedures (SOPs) for (meta)genomic annotation.

OMICS. 2008 Jun;12(2):137-41. doi: 10.1089/omi.2008.0017.

Preserving accuracy in GenBank.

Science. 2008 Mar 21;319(5870):1616. doi: 10.1126/science.319.5870.1616a.

DNA data. Proposal to 'Wikify' GenBank meets stiff resistance.

Science. 2008 Mar 21;319(5870):1598-9. doi: 10.1126/science.319.5870.1598.

The Mouse Genome Database (MGD): mouse biology and model systems.

Nucleic Acids Res. 2008 Jan;36(Database issue):D724-8. doi: 10.1093/nar/gkm961. Epub 2007 Dec 23.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr超能文献

公共数据库中的注释错误：酶超家族中分子功能的错误注释。

Annotation error in public databases: misannotation of molecular function in enzyme superfamilies.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译