School of Data and Computer Science, Guangdong Province Key Lab of Computational Science, Sun Yat-Sen University, Guangzhou, Guangdong 510006, PR China.
J Biomed Inform. 2019 Oct;98:103289. doi: 10.1016/j.jbi.2019.103289. Epub 2019 Sep 18.
Named entity recognition is a fundamental and crucial task in medical natural language processing problems. In medical fields, Chinese clinical named entity recognition identifies boundaries and types of medical entities from unstructured text such as electronic medical records. Recently, a composition model of bidirectional Long Short-term Memory Networks (BiLSTMs) and conditional random field (BiLSTM-CRF) based character-level semantics has achieved great success in Chinese clinical named entity recognition tasks. But this method can only capture contextual semantics between characters in sentences. However, Chinese characters are hieroglyphics, and deeper semantic information is hidden inside, the BiLSTM-CRF model failed to get this information. In addition, some of the entities in the sentence are dependent, but the Long Short-term Memory (LSTM) does not capture long-term dependencies perfectly between characters. So we propose a BiLSTM-CRF model based on the radical-level feature and self-attention mechanism to solve these problems. We use the convolutional neural network (CNN) to extract radical-level features, aims to capture the intrinsic and internal relevances of characters. In addition, we use self-attention mechanism to capture the dependency between characters regardless of their distance. Experiments show that our model achieves F1-score 93.00% and 86.34% on CCKS-2017 and TP_CNER dataset respectively.
命名实体识别是医学自然语言处理问题中的一项基本且关键的任务。在医学领域,中文临床命名实体识别从电子病历等非结构化文本中识别医学实体的边界和类型。最近,一种基于字符级语义的双向长短时记忆网络(BiLSTM)和条件随机场(BiLSTM-CRF)的组合模型在中文临床命名实体识别任务中取得了巨大成功。但是,这种方法只能捕获句子中字符之间的上下文语义。然而,汉字是表意文字,内部隐藏着更深层次的语义信息,BiLSTM-CRF 模型无法获取这些信息。此外,句子中的一些实体是相互依赖的,但长短时记忆(LSTM)无法完美地捕捉字符之间的长期依赖关系。因此,我们提出了一种基于部首级特征和自注意力机制的 BiLSTM-CRF 模型来解决这些问题。我们使用卷积神经网络(CNN)提取部首级特征,旨在捕获字符的内在和内部相关性。此外,我们使用自注意力机制来捕获字符之间的依赖关系,而不受其距离的影响。实验表明,我们的模型在 CCKS-2017 和 TP_CNER 数据集上分别实现了 93.00%和 86.34%的 F1 得分。