Naderi Nona, Knafou Julien, Copara Jenny, Ruch Patrick, Teodoro Douglas
Information Science Department, University of Applied Sciences and Arts of Western Switzerland (HES-SO), Geneva, Switzerland.
Swiss Institute of Bioinformatics, Geneva, Switzerland.
Front Res Metr Anal. 2021 Nov 19;6:689803. doi: 10.3389/frma.2021.689803. eCollection 2021.
The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domains-biology, chemistry, and medicine-available in different languages-English and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERT-based baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains.
健康和生命科学领域以其在大型自由文本语料库(如科学文献和电子健康记录)中发现的大量命名实体而闻名。为了挖掘此类语料库的价值,人们提出了命名实体识别(NER)方法。受基于Transformer的预训练模型在NER方面取得成功的启发,我们评估了深度掩码语言模型的个体和集成在不同健康和生命科学领域(生物学、化学和医学)、不同语言(英语和法语)的语料库上的表现。在外部语料库上进行预训练的个体深度掩码语言模型,在特定任务的领域和语言语料库上进行微调,并使用经典的多数投票策略进行集成。实验表明,集成模型相对于基于BERT的个体基线模型有统计学上的显著改进,总体最佳性能为77%的宏F1分数。我们进一步对集成结果进行了详细分析,并展示了它们的有效性如何根据实体属性(如长度、语料库频率和注释一致性)而变化。结果表明,深度掩码语言模型的集成是解决健康和生命科学领域语料库中NER问题的有效策略。