Yin Yu, Kim Hyunjae, Xiao Xiao, Wei Chih Hsuan, Kang Jaewoo, Lu Zhiyong, Xu Hua, Fang Meng, Chen Qingyu
ArXiv. 2024 Dec 30:arXiv:2406.10671v4.
Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations. While several studies have employed multi-task learning with multiple BioNER datasets to reduce human effort, this approach does not consistently yield performance improvements and may introduce label ambiguity in different biomedical corpora. We aim to tackle those challenges through transfer learning from easily accessible resources with fewer concept overlaps with biomedical datasets. We proposed GERBERA, a simple-yet-effective method that utilized general-domain NER datasets for training. We performed multi-task learning to train a pre-trained biomedical language model with both the target BioNER dataset and the general-domain dataset. Subsequently, we fine-tuned the models specifically for the BioNER dataset. We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances. Despite using fewer biomedical resources, our models demonstrated superior performance compared to baseline models trained with additional BioNER datasets. Specifically, our models consistently outperformed the baseline models in six out of eight entity types, achieving an average improvement of 0.9% over the best baseline performance across eight entities. Our method was especially effective in amplifying performance on BioNER datasets characterized by limited data, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset. This study introduces a new training method that leverages cost-effective general-domain NER datasets to augment BioNER models. This approach significantly improves BioNER model performance, making it a valuable asset for scenarios with scarce or costly biomedical datasets.
训练基于神经网络的生物医学命名实体识别(BioNER)模型通常需要大量且成本高昂的人工标注。虽然有几项研究采用了多任务学习和多个BioNER数据集来减少人工工作量,但这种方法并不能始终如一地提高性能,并且可能会在不同的生物医学语料库中引入标签歧义。我们旨在通过从与生物医学数据集概念重叠较少的易于获取的资源进行迁移学习来应对这些挑战。我们提出了GERBERA,这是一种简单而有效的方法,利用通用领域的NER数据集进行训练。我们进行了多任务学习,使用目标BioNER数据集和通用领域数据集来训练一个预训练的生物医学语言模型。随后,我们针对BioNER数据集对模型进行了微调。我们在八个实体类型的五个数据集上对GERBERA进行了系统评估,这些数据集总共包含81,410个实例。尽管使用的生物医学资源较少,但我们的模型与使用额外BioNER数据集训练的基线模型相比,表现出了卓越的性能。具体而言,我们的模型在八个实体类型中的六个上始终优于基线模型,在八个实体的最佳基线性能基础上平均提高了0.9%。我们的方法在放大以数据有限为特征的BioNER数据集的性能方面特别有效,在JNLPBA-RNA数据集上F1分数提高了4.7%。本研究介绍了一种新的训练方法,该方法利用具有成本效益的通用领域NER数据集来增强BioNER模型。这种方法显著提高了BioNER模型的性能,使其成为生物医学数据集稀缺或成本高昂的场景中的宝贵资产。