融合对抗训练与特征增强的中医命名实体识别

Chinese medical named entity recognition integrating adversarial training and feature enhancement.

作者信息

Zhang Xu, Kao Youchen, Che Shengbing, Yan Juan, Zhou Sha, Guo Shenyi, Wang Wanqin

机构信息

College of Computer Science and Mathematics, Central South University of Forestry and Technology, No.498 Shaoshan South Road, Wenyuan Street, Changsha, 410004, Hunan, China.

Information and Engineering College, Swan College, Central South University of Forestry and Technology, No.1-10 Furong North Road, Changsha, 410211, Hunan, China.

出版信息

Sci Rep. 2025 Apr 28;15(1):14844. doi: 10.1038/s41598-025-98465-3.

DOI:10.1038/s41598-025-98465-3

PMID:40295595

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12037839/

Abstract

Chinese possesses the essential attributes of unique character composition structure and the nested nature of medical entities, which causes many challenges for Chinese Electronic Health Records (EHRs) in medical named entity recognition tasks, such as scarce annotated data, strong tokenization ambiguity, and blurred entity boundaries. This increases the difficulty of extracting medical named entity categories. The paper proposes an effective Chinese clinical named entity recognition model that integrates BERT and adversarial enhancement in a dual channel architecture to address this issue. Firstly, the model integrates various advanced technologies, such as Bidirectional Long Short-Term Memory networks (BiLSTM), Iterative Deep Convolutional Neural Networks (IDCNN), and Conditional Random Fields (CRF), to improve the accuracy of named entity recognition. Secondly, the paper collected texts from medical record websites and utilized the YEDDA tool for professional annotation and processing of these texts, ultimately forming a more comprehensive target dataset. This process ensures that the model is exposed to representative Chinese clinical data during training, thereby improving recognition performance.Finally, experimental results indicate that the BPBIC model achieved a precision of 93.80%, a recall of 94.44%, and an F1 score of 94.12% on the augmented dataset CCKS2019 (CCKS2019+). Moreover, through knowledge graph analysis of medical entities extracted from single and multiple disease EHRs, the model assists doctors in achieving rapid and accurate diagnoses, thereby enhancing the efficiency of healthcare professionals.

摘要

中文具有独特的字符构成结构和医学实体的嵌套性质等本质属性，这给中文电子健康记录（EHR）在医学命名实体识别任务中带来了诸多挑战，比如标注数据稀缺、分词歧义性强以及实体边界模糊。这增加了提取医学命名实体类别的难度。本文提出了一种有效的中文临床命名实体识别模型，该模型在双通道架构中集成了BERT和对抗增强技术来解决这一问题。首先，该模型集成了各种先进技术，如双向长短期记忆网络（BiLSTM）、迭代深度卷积神经网络（IDCNN）和条件随机场（CRF），以提高命名实体识别的准确性。其次，本文从病历网站收集文本，并利用YEDDA工具对这些文本进行专业标注和处理，最终形成了一个更全面的目标数据集。这一过程确保模型在训练期间接触到具有代表性的中文临床数据，从而提高识别性能。最后，实验结果表明，BPBIC模型在增强数据集CCKS2019（CCKS2019+）上的精确率为93.80%，召回率为94.44%，F1分数为94.12%。此外，通过对从单病和多病EHR中提取的医学实体进行知识图谱分析，该模型协助医生实现快速准确的诊断，从而提高医疗专业人员的效率。