Department of Computer Science, Harbin Institute of Technology, Shenzhen, China; Peng Cheng Laboratory, Shenzhen, China.
Department of Computer Science, Harbin Institute of Technology, Shenzhen, China.
J Biomed Inform. 2022 Apr;128:104035. doi: 10.1016/j.jbi.2022.104035. Epub 2022 Feb 23.
External knowledge, such as lexicon of words in Chinese and domain knowledge graph (KG) of concepts, has been recently adopted to improve the performance of machine learning methods for named entity recognition (NER) as it can provide additional information beyond context. However, most existing studies only consider knowledge from one source (i.e., either lexicon or knowledge graph) in different ways and consider lexicon words or KG concepts independently with their boundaries. In this paper, we focus on leveraging multi-source knowledge in a unified manner where lexicon words or KG concepts are well combined with their boundaries for Chinese Clinical NER (CNER).
We propose a novel method based on relational graph convolutional network (RGCN), called MKRGCN, to utilize multi-source knowledge in a unified manner for CNER. For any sentence, a relational graph based on words or concepts in each knowledge source is constructed, where lexicon words or KG concepts appearing in the sentence are linked to the containing tokens with the boundary information of the lexicon words or KG concepts. RGCN is used to model all relational graphs constructed from multi-source knowledge, and the representations of tokens from multi-source knowledge are integrated into the context representations of tokens via an attention mechanism. Based on the knowledge-enhanced representations of tokens, we deploy a conditional random field (CRF) layer for named entity label prediction. In this study, a lexicon of words and a medical knowledge graph are used as knowledge sources for Chinese CNER.
Our proposed method achieves the best performance on CCKS2017 and CCKS2018 in Chinese with F1-scores of 91.88% and 89.91%, respectively, significantly outperforming existing methods. The extended experiments on NCBI-Disease and BC2GM in English also prove the effectiveness of our method when only considering one knowledge source via RGCN.
The MKRGCN model can integrate knowledge from the external lexicon and knowledge graph effectively for Chinese CNER and has the potential to be applied to English NER.
外部知识,如中文词汇和领域知识图谱(KG)中的概念,最近被用于提高命名实体识别(NER)的机器学习方法的性能,因为它可以提供上下文之外的额外信息。然而,大多数现有研究仅以不同的方式考虑来自单一来源(即词汇或知识图谱)的知识,并且独立考虑词汇词或 KG 概念及其边界。在本文中,我们专注于以统一的方式利用多源知识,即将词汇词或 KG 概念与其边界很好地结合起来用于中文临床 NER(CNER)。
我们提出了一种基于关系图卷积网络(RGCN)的新方法,称为 MKRGCN,用于以统一的方式利用多源知识进行 CNER。对于任何句子,构建基于每个知识源中的词汇或概念的关系图,其中句子中出现的词汇词或 KG 概念与词汇词或 KG 概念的包含令牌链接,并带有词汇词或 KG 概念的边界信息。使用 RGCN 对从多源知识构建的所有关系图进行建模,并通过注意力机制将多源知识的令牌表示集成到令牌的上下文表示中。基于知识增强的令牌表示,我们部署了条件随机场(CRF)层进行命名实体标签预测。在本研究中,词汇和医学知识图谱被用作中文 CNER 的知识源。
我们提出的方法在中文的 CCKS2017 和 CCKS2018 上取得了最佳性能,F1 得分分别为 91.88%和 89.91%,明显优于现有方法。在仅通过 RGCN 考虑一个知识源的情况下,对英文的 NCBI-Disease 和 BC2GM 的扩展实验也证明了我们方法的有效性。
MKRGCN 模型可以有效地整合来自外部词汇和知识图谱的知识,用于中文 CNER,并且有可能应用于英文 NER。