Suppr超能文献

生物创意任务2评估的评价

Evaluation of BioCreAtIvE assessment of task 2.

作者信息

Blaschke Christian, Leon Eduardo Andres, Krallinger Martin, Valencia Alfonso

机构信息

Bioalma SL, Ronda de Poniente 4- 2nd floor, Tres Cantos, E-28760, Madrid, Spain.

出版信息

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S16. doi: 10.1186/1471-2105-6-S1-S16. Epub 2005 May 24.

Abstract

BACKGROUND

Molecular Biology accumulated substantial amounts of data concerning functions of genes and proteins. Information relating to functional descriptions is generally extracted manually from textual data and stored in biological databases to build up annotations for large collections of gene products. Those annotation databases are crucial for the interpretation of large scale analysis approaches using bioinformatics or experimental techniques. Due to the growing accumulation of functional descriptions in biomedical literature the need for text mining tools to facilitate the extraction of such annotations is urgent. In order to make text mining tools useable in real world scenarios, for instance to assist database curators during annotation of protein function, comparisons and evaluations of different approaches on full text articles are needed.

RESULTS

The Critical Assessment for Information Extraction in Biology (BioCreAtIvE) contest consists of a community wide competition aiming to evaluate different strategies for text mining tools, as applied to biomedical literature. We report on task two which addressed the automatic extraction and assignment of Gene Ontology (GO) annotations of human proteins, using full text articles. The predictions of task 2 are based on triplets of protein--GO term--article passage. The annotation-relevant text passages were returned by the participants and evaluated by expert curators of the GO annotation (GOA) team at the European Institute of Bioinformatics (EBI). Each participant could submit up to three results for each sub-task comprising task 2. In total more than 15,000 individual results were provided by the participants. The curators evaluated in addition to the annotation itself, whether the protein and the GO term were correctly predicted and traceable through the submitted text fragment.

CONCLUSION

Concepts provided by GO are currently the most extended set of terms used for annotating gene products, thus they were explored to assess how effectively text mining tools are able to extract those annotations automatically. Although the obtained results are promising, they are still far from reaching the required performance demanded by real world applications. Among the principal difficulties encountered to address the proposed task, were the complex nature of the GO terms and protein names (the large range of variants which are used to express proteins and especially GO terms in free text), and the lack of a standard training set. A range of very different strategies were used to tackle this task. The dataset generated in line with the BioCreative challenge is publicly available and will allow new possibilities for training information extraction methods in the domain of molecular biology.

摘要

背景

分子生物学积累了大量有关基因和蛋白质功能的数据。与功能描述相关的信息通常是从文本数据中手动提取,并存储在生物数据库中,以便为大量基因产物建立注释。这些注释数据库对于使用生物信息学或实验技术进行大规模分析方法的解释至关重要。由于生物医学文献中功能描述的不断积累,迫切需要文本挖掘工具来促进此类注释的提取。为了使文本挖掘工具能够在实际场景中使用,例如在蛋白质功能注释过程中协助数据库管理员,需要对全文文章的不同方法进行比较和评估。

结果

生物学信息提取关键评估(BioCreAtIvE)竞赛是一个全社区范围的竞赛,旨在评估应用于生物医学文献的文本挖掘工具的不同策略。我们报告任务二,该任务使用全文文章解决人类蛋白质基因本体(GO)注释的自动提取和分配。任务2的预测基于蛋白质 - GO术语 - 文章段落的三元组。与注释相关的文本段落由参与者返回,并由欧洲生物信息学研究所(EBI)的GO注释(GOA)团队的专家管理员进行评估。每个参与者每个包含任务2的子任务最多可提交三个结果。参与者总共提供了超过15,000个单独的结果。除了注释本身,管理员还评估了蛋白质和GO术语是否通过提交的文本片段被正确预测和追踪。

结论

GO提供的概念目前是用于注释基因产物的最广泛的术语集,因此被用于探索文本挖掘工具能够多有效地自动提取这些注释。尽管获得的结果很有前景,但它们仍远未达到实际应用所需的性能。解决所提出任务时遇到的主要困难包括GO术语和蛋白质名称的复杂性(用于在自由文本中表达蛋白质尤其是GO术语的大量变体),以及缺乏标准训练集。一系列非常不同的策略被用于解决此任务。根据BioCreative挑战生成的数据集是公开可用的,并将为分子生物学领域的训练信息提取方法带来新的可能性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b18b/1869008/2fd785b37ab0/1471-2105-6-S1-S16-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验