School of Computer, University of South China, Hengyang 421001, China.
Center for Complex Networks and Systems Research, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN 47408, USA.
Int J Environ Res Public Health. 2020 Apr 14;17(8):2687. doi: 10.3390/ijerph17082687.
Electronic medical records are an integral part of medical texts. Entity recognition of electronic medical records has triggered many studies that propose many entity extraction methods. In this paper, an entity extraction model is proposed to extract entities from Chinese Electronic Medical Records (CEMR). In the input layer of the model, we use word embedding and dictionary features embedding as input vectors, where word embedding consists of a character representation and a word representation. Then, the input vectors are fed to the bidirectional long short-term memory to capture contextual features. Finally, a conditional random field is employed to capture dependencies between neighboring tags. We performed experiments on body classification task, and the F1 values reached 90.65%. We also performed experiments on anatomic region recognition task, and the F1 values reached 93.89%. On both tasks, our model had higher performance than state-of-the-art models, such as Bi-LSTM-CRF, Bi-LSTM-Attention, and Vote. Through experiments, our model has a good effect when dealing with small frequency entities and unknown entities; with a small training dataset, our method showed 2-4% improvement on F1 value compared to the basic Bi-LSTM-CRF models. Additionally, on anatomic region recognition task, besides using our proposed entity extraction model, 12 rules we designed and domain dictionary were adopted. Then, in this task, the weighted F1 value of the three specific entities extraction reached 84.36%.
电子病历是医学文本的一个组成部分。电子病历中的实体识别已经引发了许多研究,提出了许多实体提取方法。在本文中,提出了一种从中文电子病历(CEMR)中提取实体的实体提取模型。在模型的输入层,我们使用词嵌入和字典特征嵌入作为输入向量,其中词嵌入包括字符表示和词表示。然后,将输入向量输入到双向长短期记忆中以捕获上下文特征。最后,使用条件随机场捕获相邻标签之间的依赖关系。我们在体分类任务上进行了实验,F1 值达到了 90.65%。我们还在解剖区域识别任务上进行了实验,F1 值达到了 93.89%。在这两个任务中,我们的模型的性能都优于 Bi-LSTM-CRF、Bi-LSTM-Attention 和 Vote 等最先进的模型。通过实验,我们的模型在处理小频率实体和未知实体时效果良好;在使用小训练数据集时,与基本的 Bi-LSTM-CRF 模型相比,我们的方法在 F1 值上提高了 2-4%。此外,在解剖区域识别任务中,除了使用我们提出的实体提取模型外,还采用了我们设计的 12 条规则和领域字典。然后,在这个任务中,三个特定实体提取的加权 F1 值达到了 84.36%。