Chai Zhaoying, Jin Han, Shi Shenghui, Zhan Siyan, Zhuo Lin, Yang Yu, Lian Qi
IEEE/ACM Trans Comput Biol Bioinform. 2023 Jan-Feb;20(1):595-605. doi: 10.1109/TCBB.2022.3157630. Epub 2023 Feb 3.
In recent years, Biomedical Named Entity Recognition (BioNER) systems have mainly been based on deep neural networks, which are used to extract information from the rapidly expanding biomedical literature. Long-distance context autoencoding language models based on transformers have recently been employed for BioNER with great success. However, noise interference exists in the process of pre-training and fine-tuning, and there is no effective decoder for label dependency. Current models have many aspects in need of improvement for better performance. We propose two kinds of noise reduction models, Shared Labels and Dynamic Splicing, based on XLNet encoding which is a permutation language pre-training model and decoding by Conditional Random Field (CRF). By testing 15 biomedical named entity recognition datasets, the two models improved the average F1-score by 1.504 and 1.48, respectively, and state-of-the-art performance was achieved on 7 of them. Further analysis proves the effectiveness of the two models and the improvement of the recognition effect of CRF, and suggests the applicable scope of the models according to different data characteristics.
近年来,生物医学命名实体识别(BioNER)系统主要基于深度神经网络,用于从迅速增长的生物医学文献中提取信息。基于Transformer的长距离上下文自动编码语言模型最近被成功应用于BioNER。然而,在预训练和微调过程中存在噪声干扰,并且没有有效的解码器来处理标签依赖。当前模型在性能提升方面还有很多需要改进的地方。我们基于排列语言预训练模型XLNet编码并通过条件随机场(CRF)解码,提出了两种降噪模型,即共享标签模型和动态拼接模型。通过对15个生物医学命名实体识别数据集进行测试,这两种模型的平均F1分数分别提高了1.504和1.48,其中7个数据集达到了当前最优性能。进一步分析证明了这两种模型的有效性以及CRF识别效果的提升,并根据不同的数据特征给出了模型的适用范围。