Department of IT Center, the Children's Hospital, Zhejiang University School of Medicine, China; National Clinical Research Center for Child Health, China.
Department of Artificial Intelligence, Enterprise Institute, Ewell Technology, China.
J Biomed Inform. 2020 Aug;108:103481. doi: 10.1016/j.jbi.2020.103481. Epub 2020 Jul 18.
Named entity recognition (NER) is a principal task in the biomedical field and deep learning-based algorithms have been widely applied to biomedical NER. However, all of these methods that are applied to biomedical corpora use only annotated samples to maximize their performances. Thus, (1) large numbers of unannotated samples are relinquished and their values are overlooked. (2) Compared with other types of active learning (AL) algorithms, generative adversarial learning (GAN)-based AL methods have developed slowly. Furthermore, current diversity-based AL methods only compute similarities between a pair of sentences and cannot evaluate distribution similarities between groups of sentences. Annotation inconsistency is one of the significant challenges in the biomedical annotation field. Most existing methods for addressing this challenge are statistics-based or rule-based methods. (3) They require sufficient expert knowledge and complex designs. To address challenges (1), (2), and (3) simultaneously, we propose innovative algorithms.
GAN is introduced in this paper, and we propose the GAN-bidirectional long short-term memory-conditional random field (GAN-BiLSTM-CRF) and the GAN-bidirectional encoder representations from transformers-conditional random field (GAN-BERT-CRF) models, which can be considered an NER model, an AL model, and a model identifying error labels. BiLSTM-CRF or BERT-CRF is defined as the generator and a convolutional neural network (CNN)-based network is considered the discriminator. (1) The generator employs unannotated samples in addition to annotated samples to maximize NER performance. (2) The outputs of the CRF layer and the discriminator are used to select unlabeled samples for the AL task. (3) The discriminator discriminates the distribution of error labels from that of correct labels, identify error labels, and address the annotation inconsistency challenge.
The corpus from the 2010 i2b2/VA NLP challenge and the Chinese CCKS-2017 Task 2 dataset are adopted for experiments. Compared to the baseline BiLSTM-CRF and BERT-CRF, the GAN-BiLSTM-CRF and GAN-BERT-CRF models achieved significant improvements on the precision, recall, and F1 scores in terms of NER performance. Learning curves in AL experiments show the comparative results of the proposed models. Furthermore, the trained discriminator can identify samples with incorrect medical labels in both simulation and real-word experimental environments.
The idea of introducing GAN contributes significant results in terms of NER, active learning, and the ability to identify incorrect annotated samples. The benefits of GAN will be further studied.
命名实体识别(NER)是生物医学领域的主要任务,基于深度学习的算法已广泛应用于生物医学 NER。然而,所有应用于生物医学语料库的方法都只使用标注样本来最大限度地提高性能。因此,(1)大量未标注的样本被放弃,其价值被忽视。(2)与其他类型的主动学习(AL)算法相比,基于生成对抗网络(GAN)的 AL 方法发展缓慢。此外,目前基于多样性的 AL 方法仅计算一对句子之间的相似度,而无法评估句子组之间的分布相似度。标注不一致是生物医学标注领域的一个重大挑战。大多数现有的解决这个问题的方法都是基于统计或规则的方法。(3)它们需要足够的专家知识和复杂的设计。为了解决挑战(1)、(2)和(3),我们提出了创新的算法。
本文引入了 GAN,并提出了 GAN-BiLSTM-CRF 和 GAN-BERT-CRF 模型,它们可以被视为 NER 模型、AL 模型和识别错误标签的模型。BiLSTM-CRF 或 BERT-CRF 被定义为生成器,基于卷积神经网络(CNN)的网络被认为是鉴别器。(1)生成器除了使用标注样本外,还使用未标注样本来最大限度地提高 NER 性能。(2)CRF 层和鉴别器的输出用于选择用于 AL 任务的未标注样本。(3)鉴别器区分错误标签和正确标签的分布,识别错误标签,并解决标注不一致的挑战。
采用 2010 年 i2b2/VA NLP 挑战赛和中国 CCKS-2017 任务 2 数据集进行实验。与基线 BiLSTM-CRF 和 BERT-CRF 相比,GAN-BiLSTM-CRF 和 GAN-BERT-CRF 模型在 NER 性能的精度、召回率和 F1 得分方面都取得了显著的提高。AL 实验中的学习曲线显示了所提出模型的比较结果。此外,在模拟和实际实验环境中,训练有素的鉴别器可以识别出具有不正确医学标签的样本。
引入 GAN 的思想在 NER、主动学习和识别不正确标注样本的能力方面取得了显著的成果。GAN 的优势将进一步研究。