Wieser D, Kretschmann E, Apweiler R
Sequence Database Group, European Bioinformatics Institute, Cambridge, UK.
Bioinformatics. 2004 Aug 4;20 Suppl 1:i342-7. doi: 10.1093/bioinformatics/bth938.
Automatically generated annotation on protein data of UniProt (Universal Protein Resource) is planned to be publicly available on the UniProt web pages in April 2004. It is expected that the data content of over 500,000 protein entries in the TrEMBL section will be enhanced by the output of an automated annotation pipeline. However, a part of the automatically added data will be erroneous, as are parts of the information coming from other sources. We present a post-processing system called Xanthippe that is based on a simple exclusion mechanism and a decision tree approach using the C4.5 data-mining algorithm.
It is shown that Xanthippe detects and flags a large part of the annotation errors and considerably increases the reliability of both automatically generated data and annotation from other sources. As a cross-validation to Swiss-Prot shows, errors in protein descriptions, comments and keywords are successfully filtered out. Xanthippe is a contradictive application that can be combined seamlessly with predictive systems. It can be used either to improve the precision of automated annotation at a constant level of recall or increase the recall at a constant level of precision.
The application of the Xanthippe rules can be browsed at http://www.ebi.uniprot.org/
计划于2004年4月在UniProt(通用蛋白质资源)网页上公开发布对其蛋白质数据的自动注释。预计TrEMBL部分中超过50万个蛋白质条目的数据内容将通过自动注释管道的输出得到增强。然而,自动添加的数据中会有一部分是错误的,其他来源的部分信息也是如此。我们提出了一个名为Xanthippe的后处理系统,它基于一种简单的排除机制和使用C4.5数据挖掘算法的决策树方法。
结果表明,Xanthippe能检测并标记出大部分注释错误,并显著提高自动生成数据以及其他来源注释的可靠性。正如对Swiss-Prot的交叉验证所示,蛋白质描述、注释和关键词中的错误被成功过滤掉。Xanthippe是一个矛盾性的应用程序,可与预测系统无缝结合。它既可以在召回率不变的情况下提高自动注释的精度,也可以在精度不变的情况下提高召回率。