Suppr超能文献

基于分词、同义词和句子合成机制的中文临床命名实体识别:算法开发与验证

Chinese Clinical Named Entity Recognition With Segmentation Synonym Sentence Synthesis Mechanism: Algorithm Development and Validation.

作者信息

Tang Jian, Huang Zikun, Xu Hongzhen, Zhang Hao, Huang Hailing, Tang Minqiong, Luo Pengsheng, Qin Dong

机构信息

Department of Pharmacy, People's Hospital of Guilin, 12 Wenming Road, Guilin, 541000, China, 86 18978320258.

School of Science and Technology, Guilin University, Guilin, China.

出版信息

JMIR Med Inform. 2024 Nov 21;12:e60334. doi: 10.2196/60334.

Abstract

BACKGROUND

Clinical named entity recognition (CNER) is a fundamental task in natural language processing used to extract named entities from electronic medical record texts. In recent years, with the continuous development of machine learning, deep learning models have replaced traditional machine learning and template-based methods, becoming widely applied in the CNER field. However, due to the complexity of clinical texts, the diversity and large quantity of named entity types, and the unclear boundaries between different entities, existing advanced methods rely to some extent on annotated databases and the scale of embedded dictionaries.

OBJECTIVE

This study aims to address the issues of data scarcity and labeling difficulties in CNER tasks by proposing a dataset augmentation algorithm based on proximity word calculation.

METHODS

We propose a Segmentation Synonym Sentence Synthesis (SSSS) algorithm based on neighboring vocabulary, which leverages existing public knowledge without the need for manual expansion of specialized domain dictionaries. Through lexical segmentation, the algorithm replaces new synonymous vocabulary by recombining from vast natural language data, achieving nearby expansion expressions of the dataset. We applied the SSSS algorithm to the Robustly Optimized Bidirectional Encoder Representations from Transformers Pretraining Approach (RoBERTa) + conditional random field (CRF) and RoBERTa + Bidirectional Long Short-Term Memory (BiLSTM) + CRF models and evaluated our models (SSSS + RoBERTa + CRF; SSSS + RoBERTa + BiLSTM + CRF) on the China Conference on Knowledge Graph and Semantic Computing (CCKS) 2017 and 2019 datasets.

RESULTS

Our experiments demonstrated that the models SSSS + RoBERTa + CRF and SSSS + RoBERTa + BiLSTM + CRF achieved F1-scores of 91.30% and 91.35% on the CCKS-2017 dataset, respectively. They also achieved F1-scores of 83.21% and 83.01% on the CCKS-2019 dataset, respectively.

CONCLUSIONS

The experimental results indicated that our proposed method successfully expanded the dataset and remarkably improved the performance of the model, effectively addressing the challenges of data acquisition, annotation difficulties, and insufficient model generalization performance.

摘要

背景

临床命名实体识别(CNER)是自然语言处理中的一项基础任务,用于从电子病历文本中提取命名实体。近年来,随着机器学习的不断发展,深度学习模型已取代传统的机器学习和基于模板的方法,在CNER领域得到广泛应用。然而,由于临床文本的复杂性、命名实体类型的多样性和数量众多,以及不同实体之间边界不清晰,现有的先进方法在一定程度上依赖于注释数据库和嵌入词典的规模。

目的

本研究旨在通过提出一种基于邻近词计算的数据集增强算法,解决CNER任务中的数据稀缺和标注困难问题。

方法

我们提出了一种基于邻近词汇的分割同义词句子合成(SSSS)算法,该算法利用现有的公共知识,无需手动扩展专业领域词典。通过词汇分割,该算法从大量自然语言数据中重新组合替换新的同义词词汇,实现数据集的邻近扩展表达。我们将SSSS算法应用于来自变压器预训练方法的稳健优化双向编码器表示(RoBERTa)+条件随机场(CRF)和RoBERTa+双向长短期记忆(BiLSTM)+CRF模型,并在中国知识图谱与语义计算会议(CCKS)2017和2019数据集上评估我们的模型(SSSS+RoBERTa+CRF;SSSS+RoBERTa+BiLSTM+CRF)。

结果

我们的实验表明,模型SSSS+RoBERTa+CRF和SSSS+RoBERTa+BiLSTM+CRF在CCKS - 2017数据集上的F1分数分别达到91.30%和91.35%。它们在CCKS - 2019数据集上的F1分数分别也达到83.21%和83.01%。

结论

实验结果表明,我们提出的方法成功扩展了数据集,并显著提高了模型性能,有效解决了数据获取、标注困难和模型泛化性能不足的挑战。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/81be/11612518/389d09ec39fb/medinform-v12-e60334-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验