Institute for Artificial Intelligence Research and Development of Serbia, Fruškogorska 1, Novi Sad, 21000, Serbia.
Institute for Artificial Intelligence Research and Development of Serbia, Fruškogorska 1, Novi Sad, 21000, Serbia; Bayer A.G., Research and Development, Mullerstrasse 173, Berlin, 13342, Germany.
Artif Intell Med. 2024 Oct;156:102970. doi: 10.1016/j.artmed.2024.102970. Epub 2024 Aug 24.
Supervised named entity recognition (NER) in the biomedical domain depends on large sets of annotated texts with the given named entities. The creation of such datasets can be time-consuming and expensive, while extraction of new entities requires additional annotation tasks and retraining the model. This paper proposes a method for zero- and few-shot NER in the biomedical domain to address these challenges. The method is based on transforming the task of multi-class token classification into binary token classification and pre-training on a large number of datasets and biomedical entities, which allows the model to learn semantic relations between the given and potentially novel named entity labels. We have achieved average F1 scores of 35.44% for zero-shot NER, 50.10% for one-shot NER, 69.94% for 10-shot NER, and 79.51% for 100-shot NER on 9 diverse evaluated biomedical entities with fine-tuned PubMedBERT-based model. The results demonstrate the effectiveness of the proposed method for recognizing new biomedical entities with no or limited number of examples, outperforming previous transformer-based methods, and being comparable to GPT3-based models using models with over 1000 times fewer parameters. We make models and developed code publicly available.
在生物医学领域,监督命名实体识别(NER)依赖于具有给定命名实体的大型标注文本集。创建这样的数据集可能既耗时又昂贵,而提取新实体则需要额外的标注任务和重新训练模型。本文提出了一种在生物医学领域进行零样本和少样本 NER 的方法,以解决这些挑战。该方法基于将多类别标记分类任务转换为二类别标记分类,并在大量数据集和生物医学实体上进行预训练,这使得模型能够学习给定和潜在新命名实体标签之间的语义关系。我们在 9 个不同评估的生物医学实体上,使用微调后的基于 PubMedBERT 的模型,实现了零样本 NER 的平均 F1 得分为 35.44%,一 样本 NER 的平均 F1 得分为 50.10%,10 样本 NER 的平均 F1 得分为 69.94%,100 样本 NER 的平均 F1 得分为 79.51%。结果表明,该方法在识别具有少量或没有示例的新生物医学实体方面非常有效,优于之前基于转换器的方法,并且与使用 1000 多倍参数较少的模型的 GPT3 模型相当。我们公开了模型和开发的代码。
J Biomed Inform. 2024-8
JMIR Med Inform. 2024-10-17
Bioinformatics. 2024-3-29
BMC Med Inform Decis Mak. 2021-7-30
Proc (IEEE Int Conf Healthc Inform). 2022-6
Database (Oxford). 2024-7-30
Radiologie (Heidelb). 2025-4
BMC Bioinformatics. 2025-1-30