School of Information Management, Sun Yat-Sen University, Guangzhou, 510006, China.
School of Artificial Intelligence, Sun Yat-Sen University, Zhuhai, 519082, China.
BMC Med Inform Decis Mak. 2024 Aug 5;24(1):221. doi: 10.1186/s12911-024-02624-x.
Performing data augmentation in medical named entity recognition (NER) is crucial due to the unique challenges posed by this field. Medical data is characterized by high acquisition costs, specialized terminology, imbalanced distributions, and limited training resources. These factors make achieving high performance in medical NER particularly difficult. Data augmentation methods help to mitigate these issues by generating additional training samples, thus balancing data distribution, enriching the training dataset, and improving model generalization. This paper proposes two data augmentation methods-Contextual Random Replacement based on Word2Vec Augmentation (CRR) and Targeted Entity Random Replacement Augmentation (TER)-aimed at addressing the scarcity and imbalance of data in the medical domain. When combined with a deep learning-based Chinese NER model, these methods can significantly enhance performance and recognition accuracy under limited resources. Experimental results demonstrate that both augmentation methods effectively improve the recognition capability of medical named entities. Specifically, the BERT-BiLSTM-CRF model achieved the highest F1 score of 83.587%, representing a 1.49% increase over the baseline model. This validates the importance and effectiveness of data augmentation in medical NER.
在医学命名实体识别(NER)中进行数据增强至关重要,因为该领域存在独特的挑战。医学数据具有采集成本高、专业术语、分布不平衡和训练资源有限等特点。这些因素使得在医学 NER 中实现高性能变得特别困难。数据增强方法通过生成额外的训练样本来帮助缓解这些问题,从而平衡数据分布、丰富训练数据集并提高模型泛化能力。本文提出了两种数据增强方法——基于 Word2Vec 增强的上下文随机替换(CRR)和目标实体随机替换增强(TER),旨在解决医学领域数据的稀缺和不平衡问题。当与基于深度学习的中文 NER 模型结合使用时,这些方法可以在资源有限的情况下显著提高性能和识别准确性。实验结果表明,这两种增强方法都有效地提高了医学命名实体的识别能力。具体来说,BERT-BiLSTM-CRF 模型实现了 83.587%的最高 F1 分数,比基线模型提高了 1.49%。这验证了数据增强在医学 NER 中的重要性和有效性。