Suppr
超能文献

生物学文本挖掘系统评估：第二届生物创意社区挑战赛概述

Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge.

作者信息

Krallinger Martin, Morgan Alexander, Smith Larry, Leitner Florian, Tanabe Lorraine, Wilbur John, Hirschman Lynette, Valencia Alfonso

机构信息

Structural Biology and BioComputing Programme, Spanish Nacional Cancer Research Centre (CNIO), Madrid, Spain.

出版信息

Genome Biol. 2008;9 Suppl 2(Suppl 2):S1. doi: 10.1186/gb-2008-9-s2-s1. Epub 2008 Sep 1.

DOI:10.1186/gb-2008-9-s2-s1

PMID:18834487

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2559980/

Abstract

BACKGROUND

Genome sciences have experienced an increasing demand for efficient text-processing tools that can extract biologically relevant information from the growing amount of published literature. In response, a range of text-mining and information-extraction tools have recently been developed specifically for the biological domain. Such tools are only useful if they are designed to meet real-life tasks and if their performance can be estimated and compared. The BioCreative challenge (Critical Assessment of Information Extraction in Biology) consists of a collaborative initiative to provide a common evaluation framework for monitoring and assessing the state-of-the-art of text-mining systems applied to biologically relevant problems.

RESULTS

The Second BioCreative assessment (2006 to 2007) attracted 44 teams from 13 countries worldwide, with the aim of evaluating current information-extraction/text-mining technologies developed for one or more of the three tasks defined for this challenge evaluation. These tasks included the recognition of gene mentions in abstracts (gene mention task); the extraction of a list of unique identifiers for human genes mentioned in abstracts (gene normalization task); and finally the extraction of physical protein-protein interaction annotation-relevant information (protein-protein interaction task). The 'gold standard' data used for evaluating submissions for the third task was provided by the interaction databases MINT (Molecular Interaction Database) and IntAct.

CONCLUSION

The Second BioCreative assessment almost doubled the number of participants for each individual task when compared with the first BioCreative assessment. An overall improvement in terms of balanced precision and recall was observed for the best submissions for the gene mention (F score 0.87); for the gene normalization task, the best results were comparable (F score 0.81) compared with results obtained for similar tasks posed at the first BioCreative challenge. In case of the protein-protein interaction task, the importance and difficulties of experimentally confirmed annotation extraction from full-text articles were explored, yielding different results depending on the step of the annotation extraction workflow. A common characteristic observed in all three tasks was that the combination of system outputs could yield better results than any single system. Finally, the development of the first text-mining meta-server was promoted within the context of this community challenge.

摘要

背景

基因组科学对高效文本处理工具的需求日益增长，这些工具能够从不断增加的已发表文献中提取生物学相关信息。作为回应，最近专门为生物领域开发了一系列文本挖掘和信息提取工具。只有当这些工具旨在满足实际任务且其性能能够被评估和比较时，它们才有用。生物创意挑战（生物学信息提取的关键评估）是一项合作计划，旨在提供一个通用评估框架，用于监测和评估应用于生物学相关问题的文本挖掘系统的当前水平。

结果

第二次生物创意评估（2006年至2007年）吸引了来自全球13个国家的44个团队，目的是评估为此次挑战评估定义的三个任务中的一个或多个任务而开发的当前信息提取/文本挖掘技术。这些任务包括识别摘要中的基因提及（基因提及任务）；提取摘要中提到的人类基因的唯一标识符列表（基因标准化任务）；最后提取与物理蛋白质 - 蛋白质相互作用注释相关的信息（蛋白质 - 蛋白质相互作用任务）。用于评估第三个任务提交内容的“黄金标准”数据由相互作用数据库MINT（分子相互作用数据库）和IntAct提供。

结论

与第一次生物创意评估相比，第二次生物创意评估中每个单独任务的参与者数量几乎增加了一倍。对于基因提及任务的最佳提交结果，在平衡精确率和召回率方面总体有所提高（F值为0.87）；对于基因标准化任务，与第一次生物创意挑战中提出的类似任务所获得的结果相比，最佳结果相当（F值为0.81）。在蛋白质 - 蛋白质相互作用任务中，探索了从全文文章中提取经实验证实的注释的重要性和困难，根据注释提取工作流程的步骤产生了不同的结果。在所有三个任务中观察到的一个共同特征是，系统输出的组合可能产生比任何单个系统更好的结果。最后，在这个社区挑战的背景下推动了第一个文本挖掘元服务器的开发。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d296/2559980/fba1667a6373/gb-2008-9-s2-s1-1.jpg

相似文献

Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge.

Genome Biol. 2008;9 Suppl 2(Suppl 2):S1. doi: 10.1186/gb-2008-9-s2-s1. Epub 2008 Sep 1.

Overview of the protein-protein interaction annotation extraction task of BioCreative II.

Genome Biol. 2008;9 Suppl 2(Suppl 2):S4. doi: 10.1186/gb-2008-9-s2-s4. Epub 2008 Sep 1.

Overview of the BioCreative III Workshop.

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S1. doi: 10.1186/1471-2105-12-S8-S1.

Overview of BioCreAtIvE: critical assessment of information extraction for biology.

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S1. doi: 10.1186/1471-2105-6-S1-S1. Epub 2005 May 24.

MINT and IntAct contribute to the Second BioCreative challenge: serving the text-mining community with high quality molecular interaction data.

Genome Biol. 2008;9 Suppl 2(Suppl 2):S5. doi: 10.1186/gb-2008-9-s2-s5. Epub 2008 Sep 1.

Evaluation of BioCreAtIvE assessment of task 2.

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S16. doi: 10.1186/1471-2105-6-S1-S16. Epub 2005 May 24.

BioCreative III interactive task: an overview.

BMC Bioinformatics. 2011 Oct 3;12 Suppl 8(Suppl 8):S4. doi: 10.1186/1471-2105-12-S8-S4.

Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine.

Database (Oxford). 2019 Jan 1;2019:bay147. doi: 10.1093/database/bay147.

An evaluation of GO annotation retrieval for BioCreAtIvE and GOA.

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S17. doi: 10.1186/1471-2105-6-S1-S17. Epub 2005 May 24.

BioCreAtIvE task 1A: gene mention finding evaluation.

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S2. doi: 10.1186/1471-2105-6-S1-S2. Epub 2005 May 24.

引用本文的文献

A machine learning driven automated system to extract multiple information fields from safety data sheet documents.

Heliyon. 2025 Jan 27;11(4):e42215. doi: 10.1016/j.heliyon.2025.e42215. eCollection 2025 Feb 28.

Challenges and opportunities for mining adverse drug reactions: perspectives from pharma, regulatory agencies, healthcare providers and consumers.

Database (Oxford). 2022 Sep 2;2022. doi: 10.1093/database/baac071.

Pre-trained models, data augmentation, and ensemble learning for biomedical information extraction and document classification.

Database (Oxford). 2022 Aug 13;2022. doi: 10.1093/database/baac066.

PHILM2Web: A high-throughput database of macromolecular host-pathogen interactions on the Web.

Database (Oxford). 2022 Jun 30;2022. doi: 10.1093/database/baac042.

Triage of documents containing protein interactions affected by mutations using an NLP based machine learning approach.

BMC Genomics. 2020 Nov 10;21(1):773. doi: 10.1186/s12864-020-07185-7.

Rich Text Formatted EHR Narratives: A Hidden and Ignored Trove.

Stud Health Technol Inform. 2019 Aug 21;264:472-476. doi: 10.3233/SHTI190266.

Database (Oxford). 2019 Jan 1;2019. doi: 10.1093/database/baz064.

Next generation community assessment of biomedical entity recognition web servers: metrics, performance, interoperability aspects of BeCalm.

J Cheminform. 2019 Jun 24;11(1):42. doi: 10.1186/s13321-019-0363-6.

Diagnosis of Breast Hyperplasia and Evaluation of RuXian-I Based on Metabolomics Deep Belief Networks.

Int J Mol Sci. 2019 May 28;20(11):2620. doi: 10.3390/ijms20112620.

Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine.

Database (Oxford). 2019 Jan 1;2019:bay147. doi: 10.1093/database/bay147.

本文引用的文献

Linking genes to literature: text mining, information extraction, and retrieval applications for biology.

Genome Biol. 2008;9 Suppl 2(Suppl 2):S8. doi: 10.1186/gb-2008-9-s2-s8. Epub 2008 Sep 1.

MINT and IntAct contribute to the Second BioCreative challenge: serving the text-mining community with high quality molecular interaction data.

Genome Biol. 2008;9 Suppl 2(Suppl 2):S5. doi: 10.1186/gb-2008-9-s2-s5. Epub 2008 Sep 1.

Overview of BioCreative II gene normalization.

Genome Biol. 2008;9 Suppl 2(Suppl 2):S3. doi: 10.1186/gb-2008-9-s2-s3. Epub 2008 Sep 1.

Overview of BioCreative II gene mention recognition.

Genome Biol. 2008;9 Suppl 2(Suppl 2):S2. doi: 10.1186/gb-2008-9-s2-s2. Epub 2008 Sep 1.

Assessment of predictions submitted for the CASP7 function prediction category.

Proteins. 2007;69 Suppl 8:165-74. doi: 10.1002/prot.21651.

Evaluation and comparison of mammalian subcellular localization prediction methods.

BMC Bioinformatics. 2006 Dec 18;7 Suppl 5(Suppl 5):S3. doi: 10.1186/1471-2105-7-S5-S3.

What is a support vector machine?

Nat Biotechnol. 2006 Dec;24(12):1565-7. doi: 10.1038/nbt1206-1565.

EGASP: the human ENCODE Genome Annotation Assessment Project.

Genome Biol. 2006;7 Suppl 1(Suppl 1):S2.1-31. doi: 10.1186/gb-2006-7-s1-s2. Epub 2006 Aug 7.

BioCreAtIvE task 1A: gene mention finding evaluation.

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S2. doi: 10.1186/1471-2105-6-S1-S2. Epub 2005 May 24.

Evaluation of BioCreAtIvE assessment of task 2.

BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S16. doi: 10.1186/1471-2105-6-S1-S16. Epub 2005 May 24.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

Suppr超能文献

生物学文本挖掘系统评估：第二届生物创意社区挑战赛概述

Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译