过滤错误的蛋白质注释。

Filtering erroneous protein annotation.

作者信息

Wieser D, Kretschmann E, Apweiler R

机构信息

Sequence Database Group, European Bioinformatics Institute, Cambridge, UK.

出版信息

Bioinformatics. 2004 Aug 4;20 Suppl 1:i342-7. doi: 10.1093/bioinformatics/bth938.

DOI:10.1093/bioinformatics/bth938

PMID:15262818

Abstract

MOTIVATION

Automatically generated annotation on protein data of UniProt (Universal Protein Resource) is planned to be publicly available on the UniProt web pages in April 2004. It is expected that the data content of over 500,000 protein entries in the TrEMBL section will be enhanced by the output of an automated annotation pipeline. However, a part of the automatically added data will be erroneous, as are parts of the information coming from other sources. We present a post-processing system called Xanthippe that is based on a simple exclusion mechanism and a decision tree approach using the C4.5 data-mining algorithm.

RESULTS

It is shown that Xanthippe detects and flags a large part of the annotation errors and considerably increases the reliability of both automatically generated data and annotation from other sources. As a cross-validation to Swiss-Prot shows, errors in protein descriptions, comments and keywords are successfully filtered out. Xanthippe is a contradictive application that can be combined seamlessly with predictive systems. It can be used either to improve the precision of automated annotation at a constant level of recall or increase the recall at a constant level of precision.

AVAILABILITY

The application of the Xanthippe rules can be browsed at http://www.ebi.uniprot.org/

摘要

动机

计划于2004年4月在UniProt（通用蛋白质资源）网页上公开发布对其蛋白质数据的自动注释。预计TrEMBL部分中超过50万个蛋白质条目的数据内容将通过自动注释管道的输出得到增强。然而，自动添加的数据中会有一部分是错误的，其他来源的部分信息也是如此。我们提出了一个名为Xanthippe的后处理系统，它基于一种简单的排除机制和使用C4.5数据挖掘算法的决策树方法。

结果

结果表明，Xanthippe能检测并标记出大部分注释错误，并显著提高自动生成数据以及其他来源注释的可靠性。正如对Swiss-Prot的交叉验证所示，蛋白质描述、注释和关键词中的错误被成功过滤掉。Xanthippe是一个矛盾性的应用程序，可与预测系统无缝结合。它既可以在召回率不变的情况下提高自动注释的精度，也可以在精度不变的情况下提高召回率。

可用性

可在http://www.ebi.uniprot.org/浏览Xanthippe规则的应用。

相似文献

Filtering erroneous protein annotation.

Bioinformatics. 2004 Aug 4;20 Suppl 1:i342-7. doi: 10.1093/bioinformatics/bth938.

Mining sequence annotation databanks for association patterns.

Bioinformatics. 2005 Nov 1;21 Suppl 3:iii49-57. doi: 10.1093/bioinformatics/bti1206.

MineBlast: a literature presentation service supporting protein annotation by data mining of BLAST results.

Bioinformatics. 2005 Aug 15;21(16):3450-1. doi: 10.1093/bioinformatics/bti528. Epub 2005 Jun 7.

SSMap: a new UniProt-PDB mapping resource for the curation of structural-related information in the UniProt/Swiss-Prot Knowledgebase.

BMC Bioinformatics. 2008 Sep 23;9:391. doi: 10.1186/1471-2105-9-391.

ProRule: a new database containing functional and structural information on PROSITE profiles.

Bioinformatics. 2005 Nov 1;21(21):4060-6. doi: 10.1093/bioinformatics/bti614. Epub 2005 Aug 9.

UniSave: the UniProtKB sequence/annotation version database.

Bioinformatics. 2006 May 15;22(10):1284-5. doi: 10.1093/bioinformatics/btl105. Epub 2006 Mar 21.

Annotating proteins by mining protein interaction networks.

Bioinformatics. 2006 Jul 15;22(14):e260-70. doi: 10.1093/bioinformatics/btl221.

An efficient, versatile and scalable pattern growth approach to mine frequent patterns in unaligned protein sequences.

Bioinformatics. 2007 Mar 15;23(6):687-93. doi: 10.1093/bioinformatics/btl665. Epub 2007 Jan 19.

Domain-based small molecule binding site annotation.

BMC Bioinformatics. 2006 Mar 17;7:152. doi: 10.1186/1471-2105-7-152.

Association algorithm to mine the rules that govern enzyme definition and to classify protein sequences.

BMC Bioinformatics. 2006 Jun 15;7:304. doi: 10.1186/1471-2105-7-304.

引用本文的文献

Path to improving the life cycle and quality of genome-scale models of metabolism.

Cell Syst. 2021 Sep 22;12(9):842-859. doi: 10.1016/j.cels.2021.06.005.

Literature consistency of bioinformatics sequence databases is effective for assessing record quality.

Database (Oxford). 2017 Jan 1;2017(1). doi: 10.1093/database/bax021.

Automatic policing of biochemical annotations using genomic correlations.

Nat Chem Biol. 2010 Jan;6(1):34-40. doi: 10.1038/nchembio.266. Epub 2009 Nov 22.

The Universal Protein Resource (UniProt) in 2010.

Nucleic Acids Res. 2010 Jan;38(Database issue):D142-8. doi: 10.1093/nar/gkp846. Epub 2009 Oct 20.

Genome and proteome annotation: organization, interpretation and integration.

J R Soc Interface. 2009 Feb 6;6(31):129-47. doi: 10.1098/rsif.2008.0341.

The Universal Protein Resource (UniProt) 2009.

Nucleic Acids Res. 2009 Jan;37(Database issue):D169-74. doi: 10.1093/nar/gkn664. Epub 2008 Oct 4.

In silico characterization of proteins: UniProt, InterPro and Integr8.

Mol Biotechnol. 2008 Feb;38(2):165-77. doi: 10.1007/s12033-007-9003-x. Epub 2007 Oct 4.

The universal protein resource (UniProt).

Nucleic Acids Res. 2008 Jan;36(Database issue):D190-5. doi: 10.1093/nar/gkm895. Epub 2007 Nov 27.

The Universal Protein Resource (UniProt).

Nucleic Acids Res. 2007 Jan;35(Database issue):D193-7. doi: 10.1093/nar/gkl929. Epub 2006 Nov 16.

Probabilistic annotation of protein sequences based on functional classifications.

BMC Bioinformatics. 2005 Dec 14;6:302. doi: 10.1186/1471-2105-6-302.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

过滤错误的蛋白质注释。

Filtering erroneous protein annotation.

作者信息

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

动机

结果

可用性

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献