Suppr超能文献

生物学文本挖掘系统评估:第二届生物创意社区挑战赛概述

Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge.

作者信息

Krallinger Martin, Morgan Alexander, Smith Larry, Leitner Florian, Tanabe Lorraine, Wilbur John, Hirschman Lynette, Valencia Alfonso

机构信息

Structural Biology and BioComputing Programme, Spanish Nacional Cancer Research Centre (CNIO), Madrid, Spain.

出版信息

Genome Biol. 2008;9 Suppl 2(Suppl 2):S1. doi: 10.1186/gb-2008-9-s2-s1. Epub 2008 Sep 1.

Abstract

BACKGROUND

Genome sciences have experienced an increasing demand for efficient text-processing tools that can extract biologically relevant information from the growing amount of published literature. In response, a range of text-mining and information-extraction tools have recently been developed specifically for the biological domain. Such tools are only useful if they are designed to meet real-life tasks and if their performance can be estimated and compared. The BioCreative challenge (Critical Assessment of Information Extraction in Biology) consists of a collaborative initiative to provide a common evaluation framework for monitoring and assessing the state-of-the-art of text-mining systems applied to biologically relevant problems.

RESULTS

The Second BioCreative assessment (2006 to 2007) attracted 44 teams from 13 countries worldwide, with the aim of evaluating current information-extraction/text-mining technologies developed for one or more of the three tasks defined for this challenge evaluation. These tasks included the recognition of gene mentions in abstracts (gene mention task); the extraction of a list of unique identifiers for human genes mentioned in abstracts (gene normalization task); and finally the extraction of physical protein-protein interaction annotation-relevant information (protein-protein interaction task). The 'gold standard' data used for evaluating submissions for the third task was provided by the interaction databases MINT (Molecular Interaction Database) and IntAct.

CONCLUSION

The Second BioCreative assessment almost doubled the number of participants for each individual task when compared with the first BioCreative assessment. An overall improvement in terms of balanced precision and recall was observed for the best submissions for the gene mention (F score 0.87); for the gene normalization task, the best results were comparable (F score 0.81) compared with results obtained for similar tasks posed at the first BioCreative challenge. In case of the protein-protein interaction task, the importance and difficulties of experimentally confirmed annotation extraction from full-text articles were explored, yielding different results depending on the step of the annotation extraction workflow. A common characteristic observed in all three tasks was that the combination of system outputs could yield better results than any single system. Finally, the development of the first text-mining meta-server was promoted within the context of this community challenge.

摘要

背景

基因组科学对高效文本处理工具的需求日益增长,这些工具能够从不断增加的已发表文献中提取生物学相关信息。作为回应,最近专门为生物领域开发了一系列文本挖掘和信息提取工具。只有当这些工具旨在满足实际任务且其性能能够被评估和比较时,它们才有用。生物创意挑战(生物学信息提取的关键评估)是一项合作计划,旨在提供一个通用评估框架,用于监测和评估应用于生物学相关问题的文本挖掘系统的当前水平。

结果

第二次生物创意评估(2006年至2007年)吸引了来自全球13个国家的44个团队,目的是评估为此次挑战评估定义的三个任务中的一个或多个任务而开发的当前信息提取/文本挖掘技术。这些任务包括识别摘要中的基因提及(基因提及任务);提取摘要中提到的人类基因的唯一标识符列表(基因标准化任务);最后提取与物理蛋白质 - 蛋白质相互作用注释相关的信息(蛋白质 - 蛋白质相互作用任务)。用于评估第三个任务提交内容的“黄金标准”数据由相互作用数据库MINT(分子相互作用数据库)和IntAct提供。

结论

与第一次生物创意评估相比,第二次生物创意评估中每个单独任务的参与者数量几乎增加了一倍。对于基因提及任务的最佳提交结果,在平衡精确率和召回率方面总体有所提高(F值为0.87);对于基因标准化任务,与第一次生物创意挑战中提出的类似任务所获得的结果相比,最佳结果相当(F值为0.81)。在蛋白质 - 蛋白质相互作用任务中,探索了从全文文章中提取经实验证实的注释的重要性和困难,根据注释提取工作流程的步骤产生了不同的结果。在所有三个任务中观察到的一个共同特征是,系统输出的组合可能产生比任何单个系统更好的结果。最后,在这个社区挑战的背景下推动了第一个文本挖掘元服务器的开发。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d296/2559980/fba1667a6373/gb-2008-9-s2-s1-1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验