Department of Mathematics, Hainan University, Haikou 570228, China.
Department of Data Science and Big Data Technology, Hainan University, Haikou 570228, China.
J Biomed Inform. 2024 Nov;159:104739. doi: 10.1016/j.jbi.2024.104739. Epub 2024 Oct 25.
Although deep learning techniques have shown significant achievements, they frequently depend on extensive amounts of hand-labeled data and tend to perform inadequately in few-shot scenarios. The objective of this study is to devise a strategy that can improve the model's capability to recognize biomedical entities in scenarios of few-shot learning.
By redefining biomedical named entity recognition (BioNER) as a machine reading comprehension (MRC) problem, we propose a demonstration-based learning method to address few-shot BioNER, which involves constructing appropriate task demonstrations. In assessing our proposed method, we compared the proposed method with existing advanced methods using six benchmark datasets, including BC4CHEMD, BC5CDR-Chemical, BC5CDR-Disease, NCBI-Disease, BC2GM, and JNLPBA.
We examined the models' efficacy by reporting F1 scores from both the 25-shot and 50-shot learning experiments. In 25-shot learning, we observed 1.1% improvements in the average F1 scores compared to the baseline method, reaching 61.7%, 84.1%, 69.1%, 70.1%, 50.6%, and 59.9% on six datasets, respectively. In 50-shot learning, we further improved the average F1 scores by 1.0% compared to the baseline method, reaching 73.1%, 86.8%, 76.1%, 75.6%, 61.7%, and 65.4%, respectively.
We reported that in the realm of few-shot learning BioNER, MRC-based language models are much more proficient in recognizing biomedical entities compared to the sequence labeling approach. Furthermore, our MRC-language models can compete successfully with fully-supervised learning methodologies that rely heavily on the availability of abundant annotated data. These results highlight possible pathways for future advancements in few-shot BioNER methodologies.
尽管深度学习技术已经取得了显著的成就,但它们通常依赖于大量的人工标注数据,并且在少数样本情况下表现不佳。本研究的目的是设计一种策略,可以提高模型在少数样本学习情况下识别生物医学实体的能力。
通过将生物医学命名实体识别(BioNER)重新定义为机器阅读理解(MRC)问题,我们提出了一种基于演示的学习方法来解决少数样本 BioNER,包括构建适当的任务演示。在评估我们提出的方法时,我们使用六个基准数据集,包括 BC4CHEMD、BC5CDR-Chemical、BC5CDR-Disease、NCBI-Disease、BC2GM 和 JNLPBA,将提出的方法与现有的先进方法进行了比较。
我们通过报告 25 次和 50 次学习实验的 F1 分数来检查模型的效果。在 25 次学习中,与基线方法相比,我们观察到平均 F1 分数提高了 1.1%,分别达到 61.7%、84.1%、69.1%、70.1%、50.6%和 59.9%,在六个数据集上。在 50 次学习中,与基线方法相比,我们进一步将平均 F1 分数提高了 1.0%,分别达到 73.1%、86.8%、76.1%、75.6%、61.7%和 65.4%。
我们报告说,在少数样本学习 BioNER 中,基于 MRC 的语言模型在识别生物医学实体方面比基于序列标记的方法更有效。此外,我们的 MRC 语言模型可以与严重依赖大量标注数据的完全监督学习方法相媲美。这些结果为少数样本 BioNER 方法的未来发展提供了可能的途径。