Li Yongbin, Wang Xiaohua, Hui Linhu, Zou Liping, Li Hongjin, Xu Luo, Liu Weihai
School of Medical Information Engineering, Zunyi Medical University, Zunyi, China.
Radiology Department, Beilun District People's Hospital, Ningbo, China.
JMIR Med Inform. 2020 Sep 4;8(9):e19848. doi: 10.2196/19848.
Clinical named entity recognition (CNER), whose goal is to automatically identify clinical entities in electronic medical records (EMRs), is an important research direction of clinical text data mining and information extraction. The promotion of CNER can provide support for clinical decision making and medical knowledge base construction, which could then improve overall medical quality. Compared with English CNER, and due to the complexity of Chinese word segmentation and grammar, Chinese CNER was implemented later and is more challenging.
With the development of distributed representation and deep learning, a series of models have been applied in Chinese CNER. Different from the English version, Chinese CNER is mainly divided into character-based and word-based methods that cannot make comprehensive use of EMR information and cannot solve the problem of ambiguity in word representation.
In this paper, we propose a lattice long short-term memory (LSTM) model combined with a variant contextualized character representation and a conditional random field (CRF) layer for Chinese CNER: the Embeddings from Language Models (ELMo)-lattice-LSTM-CRF model. The lattice LSTM model can effectively utilize the information from characters and words in Chinese EMRs; in addition, the variant ELMo model uses Chinese characters as input instead of the character-encoding layer of the ELMo model, so as to learn domain-specific contextualized character embeddings.
We evaluated our method using two Chinese CNER datasets from the China Conference on Knowledge Graph and Semantic Computing (CCKS): the CCKS-2017 CNER dataset and the CCKS-2019 CNER dataset. We obtained F1 scores of 90.13% and 85.02% on the test sets of these two datasets, respectively.
Our results show that our proposed method is effective in Chinese CNER. In addition, the results of our experiments show that variant contextualized character representations can significantly improve the performance of the model.
临床命名实体识别(CNER)旨在自动识别电子病历(EMR)中的临床实体,是临床文本数据挖掘和信息提取的重要研究方向。CNER的推广可为临床决策和医学知识库建设提供支持,进而提高整体医疗质量。与英文CNER相比,由于中文分词和语法的复杂性,中文CNER起步较晚且更具挑战性。
随着分布式表示和深度学习的发展,一系列模型已应用于中文CNER。与英文版本不同,中文CNER主要分为基于字符和基于词的方法,这些方法无法全面利用EMR信息,也无法解决词表示中的歧义问题。
在本文中,我们提出了一种结合变体上下文相关字符表示和条件随机场(CRF)层的格长短期记忆(LSTM)模型用于中文CNER:基于语言模型(ELMo)的嵌入 - 格 - LSTM - CRF模型。格LSTM模型可以有效利用中文EMR中字符和词的信息;此外,变体ELMo模型使用汉字作为输入,而不是ELMo模型的字符编码层,从而学习特定领域的上下文相关字符嵌入。
我们使用来自中国知识图谱与语义计算会议(CCKS)的两个中文CNER数据集评估了我们的方法:CCKS - 2017 CNER数据集和CCKS - 2019 CNER数据集。我们在这两个数据集的测试集上分别获得了90.13%和85.02%的F1分数。
我们的结果表明,我们提出的方法在中文CNER中是有效的。此外,我们的实验结果表明,变体上下文相关字符表示可以显著提高模型的性能。