• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

HunFlair2 在生物医学命名实体识别和标准化工具的跨语料库评估中的应用。

HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools.

机构信息

Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany.

Center for Information and Language Processing (CIS), Ludwig Maximilian University Munich, München 80539, Germany.

出版信息

Bioinformatics. 2024 Oct 1;40(10). doi: 10.1093/bioinformatics/btae564.

DOI:10.1093/bioinformatics/btae564
PMID:39302686
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11453098/
Abstract

MOTIVATION

With the exponential growth of the life sciences literature, biomedical text mining (BTM) has become an essential technology for accelerating the extraction of insights from publications. The identification of entities in texts, such as diseases or genes, and their normalization, i.e. grounding them in knowledge base, are crucial steps in any BTM pipeline to enable information aggregation from multiple documents. However, tools for these two steps are rarely applied in the same context in which they were developed. Instead, they are applied "in the wild," i.e. on application-dependent text collections from moderately to extremely different from those used for training, varying, e.g. in focus, genre or text type. This raises the question whether the reported performance, usually obtained by training and evaluating on different partitions of the same corpus, can be trusted for downstream applications.

RESULTS

Here, we report on the results of a carefully designed cross-corpus benchmark for entity recognition and normalization, where tools were applied systematically to corpora not used during their training. Based on a survey of 28 published systems, we selected five, based on predefined criteria like feature richness and availability, for an in-depth analysis on three publicly available corpora covering four entity types. Our results present a mixed picture and show that cross-corpus performance is significantly lower than the in-corpus performance. HunFlair2, the redesigned and extended successor of the HunFlair tool, showed the best performance on average, being closely followed by PubTator Central. Our results indicate that users of BTM tools should expect a lower performance than the original published one when applying tools in "the wild" and show that further research is necessary for more robust BTM tools.

AVAILABILITY AND IMPLEMENTATION

All our models are integrated into the Natural Language Processing (NLP) framework flair: https://github.com/flairNLP/flair. Code to reproduce our results is available at: https://github.com/hu-ner/hunflair2-experiments.

摘要

动机

随着生命科学文献的指数级增长,生物医学文本挖掘(BTM)已成为从出版物中提取见解的关键技术。在任何 BTM 管道中,识别文本中的实体(如疾病或基因)及其规范化(即将其置于知识库中)是至关重要的步骤,以便能够从多个文档中进行信息聚合。然而,这两个步骤的工具很少在其开发的同一上下文中应用。相反,它们在“野外”中应用,即应用于与训练时使用的文本集合在焦点、体裁或文本类型等方面存在中度到极大差异的应用相关的文本集合。这就提出了一个问题,即通常通过在同一语料库的不同分区上进行训练和评估获得的报告性能是否可用于下游应用。

结果

在这里,我们报告了精心设计的实体识别和规范化的跨语料库基准测试的结果,其中工具系统地应用于训练过程中未使用的语料库。基于对 28 个已发布系统的调查,我们根据特征丰富度和可用性等预定义标准选择了五个系统,以便在三个公开可用的语料库上对涵盖四个实体类型的系统进行深入分析。我们的结果喜忧参半,表明跨语料库性能明显低于语料库内性能。HunFlair2 是 HunFlair 工具的重新设计和扩展版本,平均表现最好,紧随其后的是 PubTator Central。我们的结果表明,当在“野外”中应用 BTM 工具时,用户应预期性能会低于原始发布的性能,并表明需要进一步研究以开发更稳健的 BTM 工具。

可用性和实现

我们的所有模型都集成到自然语言处理(NLP)框架 flair 中:https://github.com/flairNLP/flair。重现我们结果的代码可在:https://github.com/hu-ner/hunflair2-experiments 获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4709/11453098/873554d01d18/btae564f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4709/11453098/76972ec464fd/btae564f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4709/11453098/873554d01d18/btae564f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4709/11453098/76972ec464fd/btae564f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4709/11453098/873554d01d18/btae564f2.jpg

相似文献

1
HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools.HunFlair2 在生物医学命名实体识别和标准化工具的跨语料库评估中的应用。
Bioinformatics. 2024 Oct 1;40(10). doi: 10.1093/bioinformatics/btae564.
2
HunFlair: an easy-to-use tool for state-of-the-art biomedical named entity recognition.HunFlair:一种用于最先进生物医学命名实体识别的易于使用的工具。
Bioinformatics. 2021 Sep 9;37(17):2792-2794. doi: 10.1093/bioinformatics/btab042.
3
Biomedical named entity recognition and linking datasets: survey and our recent development.生物医学命名实体识别与链接数据集:综述及我们的最新进展
Brief Bioinform. 2020 Dec 1;21(6):2219-2238. doi: 10.1093/bib/bbaa054.
4
TaggerOne: joint named entity recognition and normalization with semi-Markov Models.TaggerOne:使用半马尔可夫模型进行联合命名实体识别与归一化
Bioinformatics. 2016 Sep 15;32(18):2839-46. doi: 10.1093/bioinformatics/btw343. Epub 2016 Jun 9.
5
Deep learning with word embeddings improves biomedical named entity recognition.使用词嵌入的深度学习可改善生物医学命名实体识别。
Bioinformatics. 2017 Jul 15;33(14):i37-i48. doi: 10.1093/bioinformatics/btx228.
6
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
7
BELB: a biomedical entity linking benchmark.BELB:一个生物医学实体链接基准。
Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad698.
8
FamPlex: a resource for entity recognition and relationship resolution of human protein families and complexes in biomedical text mining.FamPlex:生物医学文本挖掘中人类蛋白质家族和复合物的实体识别和关系解析资源。
BMC Bioinformatics. 2018 Jun 28;19(1):248. doi: 10.1186/s12859-018-2211-5.
9
Exploiting and assessing multi-source data for supervised biomedical named entity recognition.利用和评估多源数据进行有监督的生物医学命名实体识别。
Bioinformatics. 2018 Jul 15;34(14):2474-2482. doi: 10.1093/bioinformatics/bty152.
10
A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools.语料库全文期刊文章是一种强大的评估工具,可用于揭示生物医学自然语言处理工具性能的差异。
BMC Bioinformatics. 2012 Aug 17;13:207. doi: 10.1186/1471-2105-13-207.

引用本文的文献

1
From knowledge silos to integrated insights: building a cardiovascular medication knowledge graph for enhanced medication knowledge retrieval, relationship discovery, and reasoning.从知识孤岛到综合洞察:构建心血管药物知识图谱以增强药物知识检索、关系发现和推理
Front Cardiovasc Med. 2025 Apr 28;12:1526247. doi: 10.3389/fcvm.2025.1526247. eCollection 2025.
2
Domain-specific embeddings uncover latent genetics knowledge.特定领域的嵌入揭示潜在的遗传学知识。
bioRxiv. 2025 Mar 19:2025.03.17.643817. doi: 10.1101/2025.03.17.643817.

本文引用的文献

1
Advancing entity recognition in biomedicine via instruction tuning of large language models.通过指令调整大型语言模型推进生物医学中的实体识别。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae163.
2
BELB: a biomedical entity linking benchmark.BELB:一个生物医学实体链接基准。
Bioinformatics. 2023 Nov 1;39(11). doi: 10.1093/bioinformatics/btad698.
3
An analysis of entity normalization evaluation biases in specialized domains.专门领域实体归一化评估偏差分析。
BMC Bioinformatics. 2023 Jun 2;24(1):227. doi: 10.1186/s12859-023-05350-9.
4
AIONER: all-in-one scheme-based biomedical named entity recognition using deep learning.AIONER:基于整体方案的深度学习生物医学命名实体识别。
Bioinformatics. 2023 May 4;39(5). doi: 10.1093/bioinformatics/btad310.
5
Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII.全文文章中的化学物质鉴定与标引:NLM-Chem 在 BioCreative VII 挑战赛中的概述
Database (Oxford). 2023 Mar 7;2023. doi: 10.1093/database/baad005.
6
An overview of biomedical entity linking throughout the years.生物医学实体链接概述。
J Biomed Inform. 2023 Jan;137:104252. doi: 10.1016/j.jbi.2022.104252. Epub 2022 Dec 2.
7
Chemical-protein relation extraction with ensembles of carefully tuned pretrained language models.基于精心调优的预训练语言模型集成的化学-蛋白质关系抽取。
Database (Oxford). 2022 Nov 18;2022. doi: 10.1093/database/baac098.
8
Comparative Toxicogenomics Database (CTD): update 2023.比较毒理学基因组数据库(CTD):2023 年更新。
Nucleic Acids Res. 2023 Jan 6;51(D1):D1257-D1262. doi: 10.1093/nar/gkac833.
9
Deep learning joint models for extracting entities and relations in biomedical: a survey and comparison.深度学习联合模型在生物医学中提取实体和关系:调查与比较。
Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac342.
10
BERN2: an advanced neural biomedical named entity recognition and normalization tool.BERN2:一种先进的神经生物医学命名实体识别和标准化工具。
Bioinformatics. 2022 Oct 14;38(20):4837-4839. doi: 10.1093/bioinformatics/btac598.