Jiale Nan, Gao Dongping, Sun Yuanyuan, Li Xiaoying, Shen Xifeng, Li Meiting, Zhang Weining, Ren Huiling, Qin Yi
Institute of Medical Information, Chinese Academy of Medical Sciences, Peking Union Medical College, Beijing, 100020, China.
Heliyon. 2022 Oct 29;8(11):e11291. doi: 10.1016/j.heliyon.2022.e11291. eCollection 2022 Nov.
With rapid development of technologies in medical diagnosis and treatment, the novel and complicated concepts and usages of clinical terms especially of surgical procedures have become common in daily routine. Expected to be performed in an operating room and accompanied by an incision based on expert discretion, surgical procedures imply clinical understanding of diagnosis, examination, testing, equipment, drugs and symptoms, etc., but terms expressing surgical procedures are difficult to recognize since the terms are highly distinctive due to long morphological length and complex linguistics phenomena. To achieve higher recognition performance and overcome the challenge of the absence of natural delimiters in Chinese sentences, we propose a Named Entity Recognition (NER) model named Structural-SoftLexicon-Bi-LSTM-CRF (SSBC) empowered by pre-trained model BERT. In particular, we pre-trained a lexicon embedding over large-scale medical corpus to better leverage domain-specific structural knowledge. With input additionally augmented by BERT, rich multigranular information and structural term information is transferred from Structural-SoftLexicon to downstream model Bi-LSTM-CRF. Therefore, we could get a global optimal prediction of input sequence. We evaluate our model on a self-built corpus and results show that SSBC with pre-trained model outperforms other state-of-the-art benchmarks, surpassing at most 3.77% in F1 score. This study hopefully would benefit Diagnostic Related Groups (DRGs) and Diagnosis Intervention Package (DIP) grouping system, medical records statistics and analysis, Medicare payment system, etc.
随着医学诊断和治疗技术的快速发展,临床术语尤其是外科手术的新颖复杂概念和用法在日常工作中已变得很常见。外科手术预期在手术室进行,并根据专家判断进行切口操作,这意味着要对诊断、检查、测试、设备、药物和症状等有临床理解,但表示外科手术的术语却难以识别,因为这些术语由于形态长度长和语言现象复杂而具有高度独特性。为了实现更高的识别性能并克服中文句子中缺乏自然分隔符的挑战,我们提出了一种名为Structural-SoftLexicon-Bi-LSTM-CRF(SSBC)的命名实体识别(NER)模型,该模型由预训练模型BERT赋能。具体而言,我们在大规模医学语料库上预训练了词嵌入,以更好地利用特定领域的结构知识。通过BERT对输入进行额外增强,丰富的多粒度信息和结构术语信息从Structural-SoftLexicon转移到下游模型Bi-LSTM-CRF。因此,我们可以得到输入序列的全局最优预测。我们在自建语料库上评估了我们的模型,结果表明,带有预训练模型的SSBC优于其他最先进的基准模型,F1分数最多提高了3.77%。本研究有望使诊断相关组(DRGs)和诊断干预包(DIP)分组系统、病历统计与分析、医疗保险支付系统等受益。