Wu Yonghui, Yang Xi, Bian Jiang, Guo Yi, Xu Hua, Hogan William
Departments of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida, USA.
School of Biomedical Informatics, the University of Texas Health Science Center at Houston, Houston, Texas, USA.
AMIA Annu Symp Proc. 2018 Dec 5;2018:1110-1117. eCollection 2018.
There has been an increasing interest in developing deep learning methods to recognize clinical concepts from narrative clinical text. Recently, several studies have reported that Recurrent Neural Networks (RNNs) outperformed traditional machine learning methods such as Conditional Random Fields (CRFs). Deep learning-based Named Entity Recognition (NER) systems often use statistical language models to learn word embeddings from unlabeled corpora. However, current word embedding methods have limitations to learn decent representations for low-frequency words. Medicine is a knowledge-extensive domain; existing medical knowledge has the potential to improve feature representations for less frequent yet important words. However, it is still not clear how existing medical knowledge can help deep learning models in clinical NER tasks. In this study, we integrated medical knowledge from the Unified Medical Language System with word embeddings trained from an unlabeled clinical corpus in RNNs for detection of problems, treatments and lab tests. We examined three different ways to generate medical knowledge features, including a dictionary lookup program, the KnowledgeMap system, and the MedLEE system. We also compared representing medical knowledge as one-hot vectors versus representing medical knowledge as embedding layers. The evaluation results showed that the RNN with medical knowledge as embedding layers achieved new state-of-the-art performance (a strict F1 score of 86.21% and a relaxed F1 score of 92.80%) on the 2010 i2b2 corpus, outperforming an RNN with only word embeddings and RNNs with medical knowledge as one-hot vectors. This study demonstrated an efficient way of integrating medical knowledge with distributed word representations for clinical NER.
开发深度学习方法以从叙述性临床文本中识别临床概念的兴趣与日俱增。最近,多项研究报告称循环神经网络(RNN)的表现优于传统机器学习方法,如条件随机场(CRF)。基于深度学习的命名实体识别(NER)系统通常使用统计语言模型从未标记语料库中学习词嵌入。然而,当前的词嵌入方法在为低频词学习合适的表示方面存在局限性。医学是一个知识广泛的领域;现有的医学知识有潜力改善对低频但重要的词的特征表示。然而,目前尚不清楚现有的医学知识如何在临床NER任务中帮助深度学习模型。在本研究中,我们将统一医学语言系统中的医学知识与在RNN中从未标记临床语料库训练得到的词嵌入相结合,用于检测问题、治疗方法和实验室检查。我们研究了三种生成医学知识特征的不同方法,包括字典查找程序、知识图谱系统和MedLEE系统。我们还比较了将医学知识表示为独热向量与将医学知识表示为嵌入层的情况。评估结果表明,在2010年i2b2语料库上,将医学知识表示为嵌入层的RNN取得了新的最优性能(严格F1分数为86.21%,宽松F1分数为92.80%),优于仅使用词嵌入的RNN以及将医学知识表示为独热向量的RNN。这项研究展示了一种将医学知识与分布式词表示相结合用于临床NER的有效方法。