Tang Buzhou, Jiang Dehuan, Chen Qingcai, Wang Xiaolong, Yan Jun, Shen Ying
Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Tech-nology, Shenzhen, China.
Corresponding author:
AMIA Annu Symp Proc. 2020 Mar 4;2019:857-863. eCollection 2019.
De-identification of clinical text, the prerequisite of electronic clinical data reuse, is a typical named entity recogni tion (NER) problem. A number of state-of-the-art deep learning methods for NER, such as Bi-LSTM-CRF (bidirec tional long-short-term-memory conditional random fields), have been applied for de-identification. Neural language models used for language representation bring great improvement in lots of NLP tasks when they are integrated with other deep learning methods. In this paper, we introduce Bi-LSTM-CRF with neural language models for de- identification of clinical text, and evaluate it on the de-identification datasets of the i2b2 2014 and the CEGS N- GRID 2016 challenges. Four neural language models of three types individually integrated with Bi-LSTM-CRF are compared in this study. Bi-LSTM-CRF with neural language models achieves the highest "strict" micro-averaged F1-score of 95.50% on the i2b2 2014 dataset and 91.82% on the CEGS N-GRID 2016 dataset, becoming new benchmark results on these two datasets respectively De-identification, Named entity recognition, Bidirectional long-short-term-memory, Conditional ran dom fields, Neural language models.
临床文本去识别化是电子临床数据复用的前提,是一个典型的命名实体识别(NER)问题。许多用于NER的先进深度学习方法,如双向长短期记忆条件随机场(Bi-LSTM-CRF),已被应用于去识别化。当神经语言模型与其他深度学习方法集成时,用于语言表示的神经语言模型在许多自然语言处理任务中带来了很大的改进。在本文中,我们引入了结合神经语言模型的Bi-LSTM-CRF用于临床文本的去识别化,并在i2b2 2014和CEGS N-GRID 2016挑战的去识别化数据集上对其进行评估。本研究比较了三种类型的四个神经语言模型分别与Bi-LSTM-CRF的集成情况。结合神经语言模型的Bi-LSTM-CRF在i2b2 2014数据集上实现了最高的“严格”微观平均F1分数,为95.50%,在CEGS N-GRID 2016数据集上为91.82%,分别成为这两个数据集上的新基准结果。去识别化、命名实体识别、双向长短期记忆、条件随机场、神经语言模型。