Suppr超能文献

对生物创意(BioCreAtIvE)和基因本体注释(GOA)的基因本体(GO)注释检索的评估。

An evaluation of GO annotation retrieval for BioCreAtIvE and GOA.

作者信息

Camon Evelyn B, Barrell Daniel G, Dimmer Emily C, Lee Vivian, Magrane Michele, Maslen John, Binns David, Apweiler Rolf

机构信息

European Molecular Biology Laboratory, European Bionformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

出版信息

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S17. doi: 10.1186/1471-2105-6-S1-S17. Epub 2005 May 24.

Abstract

BACKGROUND

The Gene Ontology Annotation (GOA) database http://www.ebi.ac.uk/GOA aims to provide high-quality supplementary GO annotation to proteins in the UniProt Knowledgebase. Like many other biological databases, GOA gathers much of its content from the careful manual curation of literature. However, as both the volume of literature and of proteins requiring characterization increases, the manual processing capability can become overloaded. Consequently, semi-automated aids are often employed to expedite the curation process. Traditionally, electronic techniques in GOA depend largely on exploiting the knowledge in existing resources such as InterPro. However, in recent years, text mining has been hailed as a potentially useful tool to aid the curation process. To encourage the development of such tools, the GOA team at EBI agreed to take part in the functional annotation task of the BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology) challenge. BioCreAtIvE task 2 was an experiment to test if automatically derived classification using information retrieval and extraction could assist expert biologists in the annotation of the GO vocabulary to the proteins in the UniProt Knowledgebase. GOA provided the training corpus of over 9000 manual GO annotations extracted from the literature. For the test set, we provided a corpus of 200 new Journal of Biological Chemistry articles used to annotate 286 human proteins with GO terms. A team of experts manually evaluated the results of 9 participating groups, each of which provided highlighted sentences to support their GO and protein annotation predictions. Here, we give a biological perspective on the evaluation, explain how we annotate GO using literature and offer some suggestions to improve the precision of future text-retrieval and extraction techniques. Finally, we provide the results of the first inter-annotator agreement study for manual GO curation, as well as an assessment of our current electronic GO annotation strategies.

RESULTS

The GOA database currently extracts GO annotation from the literature with 91 to 100% precision, and at least 72% recall. This creates a particularly high threshold for text mining systems which in BioCreAtIvE task 2 (GO annotation extraction and retrieval) initial results precisely predicted GO terms only 10 to 20% of the time.

CONCLUSION

Improvements in the performance and accuracy of text mining for GO terms should be expected in the next BioCreAtIvE challenge. In the meantime the manual and electronic GO annotation strategies already employed by GOA will provide high quality annotations.

摘要

背景

基因本体注释(GOA)数据库(http://www.ebi.ac.uk/GOA)旨在为通用蛋白质数据库(UniProt知识库)中的蛋白质提供高质量的补充GO注释。与许多其他生物数据库一样,GOA的大部分内容来自对文献的精心人工整理。然而,随着文献量和需要表征的蛋白质数量的增加,人工处理能力可能会不堪重负。因此,经常采用半自动辅助工具来加快整理过程。传统上,GOA中的电子技术在很大程度上依赖于利用现有资源(如InterPro)中的知识。然而,近年来,文本挖掘被誉为辅助整理过程的一种潜在有用工具。为鼓励此类工具的开发,欧洲生物信息学研究所(EBI)的GOA团队同意参与生物创意(生物学信息提取系统的关键评估)挑战的功能注释任务。生物创意任务2是一项实验,旨在测试使用信息检索和提取自动得出的分类是否有助于专家生物学家对通用蛋白质数据库中的蛋白质进行GO词汇注释。GOA提供了从文献中提取的9000多个手动GO注释的训练语料库。对于测试集,我们提供了一个包含200篇《生物化学杂志》新文章的语料库,用于用GO术语注释286种人类蛋白质。一组专家手动评估了9个参与小组的结果,每个小组都提供了突出显示的句子来支持他们的GO和蛋白质注释预测。在此,我们从生物学角度对评估进行阐述,解释我们如何使用文献对GO进行注释,并就提高未来文本检索和提取技术的精度提出一些建议。最后,我们提供了首次人工GO整理注释者间一致性研究的结果,以及对我们当前电子GO注释策略的评估。

结果

GOA数据库目前从文献中提取GO注释的精度为91%至100%,召回率至少为72%。这为文本挖掘系统设定了一个特别高的门槛,在生物创意任务2(GO注释提取和检索)中,初始结果仅在10%至20%的时间内准确预测了GO术语。

结论

在下一次生物创意挑战中,预计GO术语文本挖掘的性能和准确性会有所提高。与此同时,GOA已经采用的人工和电子GO注释策略将提供高质量的注释。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e7c3/1869009/5a4309b97389/1471-2105-6-S1-S17-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验