Institute of Medical Information, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing, China.
National Engineering Laboratory for Internet Medical Systems and Applications, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China.
Math Biosci Eng. 2023 Jan;20(1):1018-1036. doi: 10.3934/mbe.2023047. Epub 2022 Oct 24.
Medical procedure entity normalization is an important task to realize medical information sharing at the semantic level; it faces main challenges such as variety and similarity in real-world practice. Although deep learning-based methods have been successfully applied to biomedical entity normalization, they often depend on traditional context-independent word embeddings, and there is minimal research on medical entity recognition in Chinese Regarding the entity normalization task as a sentence pair classification task, we applied a three-step framework to normalize Chinese medical procedure terms, and it consists of dataset construction, candidate concept generation and candidate concept ranking. For dataset construction, external knowledge base and easy data augmentation skills were used to increase the diversity of training samples. For candidate concept generation, we implemented the BM25 retrieval method based on integrating synonym knowledge of SNOMED CT and train data. For candidate concept ranking, we designed a stacking-BERT model, including the original BERT-based and Siamese-BERT ranking models, to capture the semantic information and choose the optimal mapping pairs by the stacking mechanism. In the training process, we also added the tricks of adversarial training to improve the learning ability of the model on small-scale training data. Based on the clinical entity normalization task dataset of the 5th China Health Information Processing Conference, our stacking-BERT model achieved an accuracy of 93.1%, which outperformed the single BERT models and other traditional deep learning models. In conclusion, this paper presents an effective method for Chinese medical procedure entity normalization and validation of different BERT-based models. In addition, we found that the tricks of adversarial training and data augmentation can effectively improve the effect of the deep learning model for small samples, which might provide some useful ideas for future research.
医学过程实体规范化是实现语义级医学信息共享的重要任务;它面临着来自真实世界实践中的多样性和相似性等主要挑战。虽然基于深度学习的方法已成功应用于生物医学实体规范化,但它们通常依赖于传统的与上下文无关的词嵌入,并且针对中文医学实体识别的研究很少。我们将实体规范化任务视为句子对分类任务,应用了一个三步框架来规范化中文医学过程术语,该框架包括数据集构建、候选概念生成和候选概念排序。在数据集构建方面,我们使用外部知识库和易于扩充的技巧来增加训练样本的多样性。在候选概念生成方面,我们实现了基于集成 SNOMED CT 同义词知识和训练数据的 BM25 检索方法。在候选概念排序方面,我们设计了一个堆叠式 BERT 模型,包括基于原始 BERT 的和孪生 BERT 排序模型,通过堆叠机制捕捉语义信息并选择最佳映射对。在训练过程中,我们还添加了对抗训练的技巧,以提高模型对小规模训练数据的学习能力。基于第五届中国健康信息处理大会的临床实体规范化任务数据集,我们的堆叠式 BERT 模型在准确性方面达到了 93.1%,优于单 BERT 模型和其他传统深度学习模型。总之,本文提出了一种有效的中文医学过程实体规范化方法,并对不同的 BERT 模型进行了验证。此外,我们发现对抗训练和数据扩充技巧可以有效地提高深度学习模型对小样本的效果,这可能为未来的研究提供一些有用的思路。