School of Mathematical Sciences, Peking University, Beijing 100871, China; Center for Statistical Sciences, Peking University, Beijing 100871, China.
Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, China.
J Biomed Inform. 2020 Jul;107:103422. doi: 10.1016/j.jbi.2020.103422. Epub 2020 Apr 28.
Clinical Named Entity Recognition (CNER) is a critical task which aims to identify and classify clinical terms in electronic medical records. In recent years, deep neural networks have achieved significant success in CNER. However, these methods require high-quality and large-scale labeled clinical data, which is challenging and expensive to obtain, especially data on Chinese clinical records. To tackle the Chinese CNER task, we pre-train BERT model on the unlabeled Chinese clinical records, which can leverage the unlabeled domain-specific knowledge. Different layers such as Long Short-Term Memory (LSTM) and Conditional Random Field (CRF) are used to extract the text features and decode the predicted tags respectively. In addition, we propose a new strategy to incorporate dictionary features into the model. Radical features of Chinese characters are used to improve the model performance as well. To the best of our knowledge, our ensemble model outperforms the state of the art models which achieves 89.56% strict F1 score on the CCKS-2018 dataset and 91.60% F1 score on CCKS-2017 dataset.
临床命名实体识别(CNER)是一项关键任务,旨在识别和分类电子病历中的临床术语。近年来,深度学习在 CNER 方面取得了重大成功。然而,这些方法需要高质量和大规模的标记临床数据,这在获取方面具有挑战性和昂贵,尤其是中文临床记录的数据。为了解决中文 CNER 任务,我们在未标记的中文临床记录上预训练 BERT 模型,从而利用未标记的特定领域知识。不同的层,如长短期记忆(LSTM)和条件随机场(CRF),分别用于提取文本特征和解码预测标签。此外,我们提出了一种将字典特征纳入模型的新策略。汉字的部首特征也被用来提高模型性能。据我们所知,我们的集成模型优于最先进的模型,在 CCKS-2018 数据集上实现了 89.56%的严格 F1 得分,在 CCKS-2017 数据集上实现了 91.60%的 F1 得分。