Department of Computer Science, University of Liverpool, Liverpool L69 3DR, United Kingdom.
Department of Computer Science, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea.
J Biomed Inform. 2024 Nov;159:104731. doi: 10.1016/j.jbi.2024.104731. Epub 2024 Oct 4.
Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations. While several studies have employed multi-task learning with multiple BioNER datasets to reduce human effort, this approach does not consistently yield performance improvements and may introduce label ambiguity in different biomedical corpora. We aim to tackle those challenges through transfer learning from easily accessible resources with fewer concept overlaps with biomedical datasets.
We proposed GERBERA, a simple-yet-effective method that utilized general-domain NER datasets for training. We performed multi-task learning to train a pre-trained biomedical language model with both the target BioNER dataset and the general-domain dataset. Subsequently, we fine-tuned the models specifically for the BioNER dataset.
We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances. Despite using fewer biomedical resources, our models demonstrated superior performance compared to baseline models trained with additional BioNER datasets. Specifically, our models consistently outperformed the baseline models in six out of eight entity types, achieving an average improvement of 0.9% over the best baseline performance across eight entities. Our method was especially effective in amplifying performance on BioNER datasets characterized by limited data, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset.
This study introduces a new training method that leverages cost-effective general-domain NER datasets to augment BioNER models. This approach significantly improves BioNER model performance, making it a valuable asset for scenarios with scarce or costly biomedical datasets. We make data, codes, and models publicly available via https://github.com/qingyu-qc/bioner_gerbera.
训练基于神经网络的生物医学命名实体识别(BioNER)模型通常需要大量且昂贵的人工注释。虽然有几项研究采用多任务学习和多个 BioNER 数据集来减少人工工作量,但这种方法并不能始终提高性能,并且在不同的生物医学语料库中可能会引入标签歧义。我们旨在通过从与生物医学数据集重叠较少的易获取资源中进行迁移学习来解决这些挑战。
我们提出了 GERBERA,这是一种简单而有效的方法,利用通用领域 NER 数据集进行训练。我们进行了多任务学习,使用目标 BioNER 数据集和通用领域数据集来训练预训练的生物医学语言模型。然后,我们专门针对 BioNER 数据集对模型进行微调。
我们在五个数据集(八个实体类型)上系统地评估了 GERBERA,总共包含 81410 个实例。尽管使用了较少的生物医学资源,但与使用额外的 BioNER 数据集训练的基线模型相比,我们的模型表现出了更好的性能。具体来说,我们的模型在八个实体类型中的六个类型中始终优于基线模型,在八个实体中,最佳基线性能的平均提高了 0.9%。我们的方法在数据有限的 BioNER 数据集上特别有效,在 JNLPBA-RNA 数据集上 F1 分数提高了 4.7%。
本研究提出了一种新的训练方法,利用具有成本效益的通用领域 NER 数据集来增强 BioNER 模型。这种方法显著提高了 BioNER 模型的性能,使其成为生物医学数据稀缺或昂贵的情况下的有价值资产。我们通过 https://github.com/qingyu-qc/bioner_gerbera 公开了数据、代码和模型。