Suppr超能文献

公共数据库中的注释错误:酶超家族中分子功能的错误注释。

Annotation error in public databases: misannotation of molecular function in enzyme superfamilies.

机构信息

Graduate Group in Biophysics, University of California San Francisco, San Francisco, California, United States of America.

出版信息

PLoS Comput Biol. 2009 Dec;5(12):e1000605. doi: 10.1371/journal.pcbi.1000605. Epub 2009 Dec 11.

Abstract

Due to the rapid release of new data from genome sequencing projects, the majority of protein sequences in public databases have not been experimentally characterized; rather, sequences are annotated using computational analysis. The level of misannotation and the types of misannotation in large public databases are currently unknown and have not been analyzed in depth. We have investigated the misannotation levels for molecular function in four public protein sequence databases (UniProtKB/Swiss-Prot, GenBank NR, UniProtKB/TrEMBL, and KEGG) for a model set of 37 enzyme families for which extensive experimental information is available. The manually curated database Swiss-Prot shows the lowest annotation error levels (close to 0% for most families); the two other protein sequence databases (GenBank NR and TrEMBL) and the protein sequences in the KEGG pathways database exhibit similar and surprisingly high levels of misannotation that average 5%-63% across the six superfamilies studied. For 10 of the 37 families examined, the level of misannotation in one or more of these databases is >80%. Examination of the NR database over time shows that misannotation has increased from 1993 to 2005. The types of misannotation that were found fall into several categories, most associated with "overprediction" of molecular function. These results suggest that misannotation in enzyme superfamilies containing multiple families that catalyze different reactions is a larger problem than has been recognized. Strategies are suggested for addressing some of the systematic problems contributing to these high levels of misannotation.

摘要

由于基因组测序项目不断快速发布新数据,公共数据库中的大多数蛋白质序列尚未经过实验验证;相反,这些序列是使用计算分析进行注释的。目前还不知道公共大型数据库中的错误注释水平和错误注释类型,也没有对其进行深入分析。我们针对四个公共蛋白质序列数据库(UniProtKB/Swiss-Prot、GenBank NR、UniProtKB/TrEMBL 和 KEGG)中的分子功能错误注释水平进行了研究,这些数据库的研究模型集为 37 个酶家族,这些酶家族拥有广泛的实验信息。经过人工注释的数据库 Swiss-Prot 显示出最低的注释错误水平(对于大多数家族,接近 0%);另外两个蛋白质序列数据库(GenBank NR 和 TrEMBL)和 KEGG 途径数据库中的蛋白质序列显示出相似且令人惊讶的高错误注释水平,在研究的六个超家族中平均为 5%-63%。在所检查的 37 个家族中的 10 个家族中,一个或多个数据库中的错误注释水平>80%。对 NR 数据库进行的时间研究表明,错误注释从 1993 年到 2005 年有所增加。发现的错误注释类型分为几类,大多数与分子功能的“过度预测”有关。这些结果表明,在包含催化不同反应的多个家族的酶超家族中,错误注释是一个比以前认识到的更大的问题。我们提出了一些策略来解决导致这些高错误注释水平的系统问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3f9a/2781113/732434da2a3c/pcbi.1000605.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验