Zhang Zhichang, Zhu Lin, Yu Peilin
College of Computer Science and Engineering, University of Northwest Normal, Lanzhou, China.
JMIR Med Inform. 2020 May 4;8(5):e17637. doi: 10.2196/17637.
Medical entity recognition is a key technology that supports the development of smart medicine. Existing methods on English medical entity recognition have undergone great development, but their progress in the Chinese language has been slow. Because of limitations due to the complexity of the Chinese language and annotated corpora, these methods are based on simple neural networks, which cannot effectively extract the deep semantic representations of electronic medical records (EMRs) and be used on the scarce medical corpora. We thus developed a new Chinese EMR (CEMR) dataset with six types of entities and proposed a multi-level representation learning model based on Bidirectional Encoder Representation from Transformers (BERT) for Chinese medical entity recognition.
This study aimed to improve the performance of the language model by having it learn multi-level representation and recognize Chinese medical entities.
In this paper, the pretraining language representation model was investigated; utilizing information not only from the final layer but from intermediate layers was found to affect the performance of the Chinese medical entity recognition task. Therefore, we proposed a multi-level representation learning model for entity recognition in Chinese EMRs. Specifically, we first used the BERT language model to extract semantic representations. Then, the multi-head attention mechanism was leveraged to automatically extract deeper semantic information from each layer. Finally, semantic representations from multi-level representation extraction were utilized as the final semantic context embedding for each token and we used softmax to predict the entity tags.
The best F1 score reached by the experiment was 82.11% when using the CEMR dataset, and the F1 score when using the CCKS (China Conference on Knowledge Graph and Semantic Computing) 2018 benchmark dataset further increased to 83.18%. Various comparative experiments showed that our proposed method outperforms methods from previous work and performs as a new state-of-the-art method.
The multi-level representation learning model is proposed as a method to perform the Chinese EMRs entity recognition task. Experiments on two clinical datasets demonstrate the usefulness of using the multi-head attention mechanism to extract multi-level representation as part of the language model.
医学实体识别是支持智能医学发展的关键技术。现有的英文医学实体识别方法已经取得了很大进展,但在中文方面进展缓慢。由于中文语言和标注语料库的复杂性导致的局限性,这些方法基于简单神经网络,无法有效提取电子病历(EMR)的深度语义表示,也无法应用于稀缺的医学语料库。因此,我们开发了一个包含六种实体类型的新型中文电子病历(CEMR)数据集,并提出了一种基于双向编码器表征来自变换器(BERT)的多级表征学习模型用于中文医学实体识别。
本研究旨在通过让语言模型学习多级表征来提高其性能并识别中文医学实体。
本文对预训练语言表征模型进行了研究;发现不仅利用来自最后一层的信息,还利用中间层的信息会影响中文医学实体识别任务的性能。因此,我们提出了一种用于中文电子病历实体识别的多级表征学习模型。具体来说,我们首先使用BERT语言模型提取语义表征。然后,利用多头注意力机制从每一层自动提取更深层次的语义信息。最后,将来自多级表征提取的语义表征用作每个词元的最终语义上下文嵌入,并使用softmax预测实体标签。
使用CEMR数据集时实验达到的最佳F1分数为82.11%,使用CCKS(中国知识图谱与语义计算大会)2018基准数据集时F1分数进一步提高到83.18%。各种对比实验表明,我们提出的方法优于先前工作中的方法,并作为一种新的最先进方法。
提出了多级表征学习模型作为执行中文电子病历实体识别任务的一种方法。在两个临床数据集上的实验证明了使用多头注意力机制提取多级表征作为语言模型一部分的有用性。