Suppr超能文献

HunFlair2 在生物医学命名实体识别和标准化工具的跨语料库评估中的应用。

HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools.

机构信息

Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany.

Center for Information and Language Processing (CIS), Ludwig Maximilian University Munich, München 80539, Germany.

出版信息

Bioinformatics. 2024 Oct 1;40(10). doi: 10.1093/bioinformatics/btae564.

Abstract

MOTIVATION

With the exponential growth of the life sciences literature, biomedical text mining (BTM) has become an essential technology for accelerating the extraction of insights from publications. The identification of entities in texts, such as diseases or genes, and their normalization, i.e. grounding them in knowledge base, are crucial steps in any BTM pipeline to enable information aggregation from multiple documents. However, tools for these two steps are rarely applied in the same context in which they were developed. Instead, they are applied "in the wild," i.e. on application-dependent text collections from moderately to extremely different from those used for training, varying, e.g. in focus, genre or text type. This raises the question whether the reported performance, usually obtained by training and evaluating on different partitions of the same corpus, can be trusted for downstream applications.

RESULTS

Here, we report on the results of a carefully designed cross-corpus benchmark for entity recognition and normalization, where tools were applied systematically to corpora not used during their training. Based on a survey of 28 published systems, we selected five, based on predefined criteria like feature richness and availability, for an in-depth analysis on three publicly available corpora covering four entity types. Our results present a mixed picture and show that cross-corpus performance is significantly lower than the in-corpus performance. HunFlair2, the redesigned and extended successor of the HunFlair tool, showed the best performance on average, being closely followed by PubTator Central. Our results indicate that users of BTM tools should expect a lower performance than the original published one when applying tools in "the wild" and show that further research is necessary for more robust BTM tools.

AVAILABILITY AND IMPLEMENTATION

All our models are integrated into the Natural Language Processing (NLP) framework flair: https://github.com/flairNLP/flair. Code to reproduce our results is available at: https://github.com/hu-ner/hunflair2-experiments.

摘要

动机

随着生命科学文献的指数级增长,生物医学文本挖掘(BTM)已成为从出版物中提取见解的关键技术。在任何 BTM 管道中,识别文本中的实体(如疾病或基因)及其规范化(即将其置于知识库中)是至关重要的步骤,以便能够从多个文档中进行信息聚合。然而,这两个步骤的工具很少在其开发的同一上下文中应用。相反,它们在“野外”中应用,即应用于与训练时使用的文本集合在焦点、体裁或文本类型等方面存在中度到极大差异的应用相关的文本集合。这就提出了一个问题,即通常通过在同一语料库的不同分区上进行训练和评估获得的报告性能是否可用于下游应用。

结果

在这里,我们报告了精心设计的实体识别和规范化的跨语料库基准测试的结果,其中工具系统地应用于训练过程中未使用的语料库。基于对 28 个已发布系统的调查,我们根据特征丰富度和可用性等预定义标准选择了五个系统,以便在三个公开可用的语料库上对涵盖四个实体类型的系统进行深入分析。我们的结果喜忧参半,表明跨语料库性能明显低于语料库内性能。HunFlair2 是 HunFlair 工具的重新设计和扩展版本,平均表现最好,紧随其后的是 PubTator Central。我们的结果表明,当在“野外”中应用 BTM 工具时,用户应预期性能会低于原始发布的性能,并表明需要进一步研究以开发更稳健的 BTM 工具。

可用性和实现

我们的所有模型都集成到自然语言处理(NLP)框架 flair 中:https://github.com/flairNLP/flair。重现我们结果的代码可在:https://github.com/hu-ner/hunflair2-experiments 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4709/11453098/76972ec464fd/btae564f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验