Kretschmann E, Fleischmann W, Apweiler R
The EMBL Outstation, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
Bioinformatics. 2001 Oct;17(10):920-6. doi: 10.1093/bioinformatics/17.10.920.
The gap between the amount of newly submitted protein data and reliable functional annotation in public databases is growing. Traditional manual annotation by literature curation and sequence analysis tools without the use of automated annotation systems is not able to keep up with the ever increasing quantity of data that is submitted. Automated supplements to manually curated databases such as TrEMBL or GenPept cover raw data but provide only limited annotation. To improve this situation automatic tools are needed that support manual annotation, automatically increase the amount of reliable information and help to detect inconsistencies in manually generated annotations.
A standard data mining algorithm was successfully applied to gain knowledge about the Keyword annotation in SWISS-PROT. 11 306 rules were generated, which are provided in a database and can be applied to yet unannotated protein sequences and viewed using a web browser. They rely on the taxonomy of the organism, in which the protein was found and on signature matches of its sequence. The statistical evaluation of the generated rules by cross-validation suggests that by applying them on arbitrary proteins 33% of their keyword annotation can be generated with an error rate of 1.5%. The coverage rate of the keyword annotation can be increased to 60% by tolerating a higher error rate of 5%.
The results of the automatic data mining process can be browsed on http://golgi.ebi.ac.uk:8080/Spearmint/ Source code is available upon request.
公共数据库中新增蛋白质数据量与可靠功能注释之间的差距正在扩大。传统的通过文献整理和序列分析工具进行的手动注释,不使用自动注释系统,已无法跟上不断增加的提交数据量。像TrEMBL或GenPept这样手动管理数据库的自动补充涵盖了原始数据,但仅提供有限的注释。为改善这种情况,需要自动工具来支持手动注释,自动增加可靠信息的数量,并有助于检测手动生成注释中的不一致之处。
一种标准数据挖掘算法成功应用于获取关于SWISS-PROT中关键词注释的知识。生成了11306条规则,这些规则存储在一个数据库中,可应用于尚未注释的蛋白质序列,并可通过网络浏览器查看。它们依赖于发现蛋白质的生物体的分类学以及其序列的特征匹配。通过交叉验证对生成规则进行的统计评估表明,将它们应用于任意蛋白质时,可生成其33%的关键词注释,错误率为1.5%。通过容忍5%的更高错误率,关键词注释的覆盖率可提高到60%。
自动数据挖掘过程的结果可在http://golgi.ebi.ac.uk:8080/Spearmint/上浏览。源代码可根据要求提供。