用于健康与生命科学语料库中有效命名实体识别的深度掩码语言模型集成

Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora.

作者信息

Naderi Nona, Knafou Julien, Copara Jenny, Ruch Patrick, Teodoro Douglas

机构信息

Information Science Department, University of Applied Sciences and Arts of Western Switzerland (HES-SO), Geneva, Switzerland.

Swiss Institute of Bioinformatics, Geneva, Switzerland.

出版信息

Front Res Metr Anal. 2021 Nov 19;6:689803. doi: 10.3389/frma.2021.689803. eCollection 2021.

DOI:10.3389/frma.2021.689803

PMID:34870074

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8640190/

Abstract

The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domains-biology, chemistry, and medicine-available in different languages-English and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERT-based baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains.

摘要

健康和生命科学领域以其在大型自由文本语料库（如科学文献和电子健康记录）中发现的大量命名实体而闻名。为了挖掘此类语料库的价值，人们提出了命名实体识别（NER）方法。受基于Transformer的预训练模型在NER方面取得成功的启发，我们评估了深度掩码语言模型的个体和集成在不同健康和生命科学领域（生物学、化学和医学）、不同语言（英语和法语）的语料库上的表现。在外部语料库上进行预训练的个体深度掩码语言模型，在特定任务的领域和语言语料库上进行微调，并使用经典的多数投票策略进行集成。实验表明，集成模型相对于基于BERT的个体基线模型有统计学上的显著改进，总体最佳性能为77%的宏F1分数。我们进一步对集成结果进行了详细分析，并展示了它们的有效性如何根据实体属性（如长度、语料库频率和注释一致性）而变化。结果表明，深度掩码语言模型的集成是解决健康和生命科学领域语料库中NER问题的有效策略。

相似文献

Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora.

Front Res Metr Anal. 2021 Nov 19;6:689803. doi: 10.3389/frma.2021.689803. eCollection 2021.

Evaluation of clinical named entity recognition methods for Serbian electronic health records.

Int J Med Inform. 2022 Aug;164:104805. doi: 10.1016/j.ijmedinf.2022.104805. Epub 2022 May 25.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

A comparative study of pre-trained language models for named entity recognition in clinical trial eligibility criteria from multiple corpora.

BMC Med Inform Decis Mak. 2022 Sep 6;22(Suppl 3):235. doi: 10.1186/s12911-022-01967-7.

Transformers-sklearn: a toolkit for medical language understanding with transformer-based models.

BMC Med Inform Decis Mak. 2021 Jul 30;21(Suppl 2):90. doi: 10.1186/s12911-021-01459-0.

Biomedical named entity recognition using deep neural networks with contextual information.

BMC Bioinformatics. 2019 Dec 27;20(1):735. doi: 10.1186/s12859-019-3321-4.

Extracting comprehensive clinical information for breast cancer using deep learning methods.

Int J Med Inform. 2019 Dec;132:103985. doi: 10.1016/j.ijmedinf.2019.103985. Epub 2019 Oct 2.

Clinical Named Entity Recognition from Chinese Electronic Medical Records Based on Deep Learning Pretraining.

J Healthc Eng. 2020 Nov 24;2020:8829219. doi: 10.1155/2020/8829219. eCollection 2020.

An imConvNet-based deep learning model for Chinese medical named entity recognition.

BMC Med Inform Decis Mak. 2022 Nov 21;22(1):303. doi: 10.1186/s12911-022-02049-4.

Clinical concept extraction using transformers.

J Am Med Inform Assoc. 2020 Dec 9;27(12):1935-1942. doi: 10.1093/jamia/ocaa189.

引用本文的文献

Fine-tuning of language models for automated structuring of medical exam reports to improve patient screening and analysis.

Sci Rep. 2025 Jul 4;15(1):23949. doi: 10.1038/s41598-025-05695-6.

A Dataset for Evaluating Contextualized Representation of Biomedical Concepts in Language Models.

Sci Data. 2024 May 4;11(1):455. doi: 10.1038/s41597-024-03317-w.

Exploring the Latest Highlights in Medical Natural Language Processing across Multiple Languages: A Survey.

Yearb Med Inform. 2023 Aug;32(1):230-243. doi: 10.1055/s-0043-1768726. Epub 2023 Dec 26.

Ensemble of deep learning language models to support the creation of living systematic reviews for the COVID-19 literature.

Syst Rev. 2023 Jun 5;12(1):94. doi: 10.1186/s13643-023-02247-9.

本文引用的文献

ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents.

Front Res Metr Anal. 2021 Mar 25;6:654438. doi: 10.3389/frma.2021.654438. eCollection 2021.

Medical Information Extraction in the Age of Deep Learning.

Yearb Med Inform. 2020 Aug;29(1):208-220. doi: 10.1055/s-0040-1702001. Epub 2020 Aug 21.

Clinical concept extraction: A methodology review.

J Biomed Inform. 2020 Sep;109:103526. doi: 10.1016/j.jbi.2020.103526. Epub 2020 Aug 6.

2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records.

J Am Med Inform Assoc. 2020 Jan 1;27(1):3-12. doi: 10.1093/jamia/ocz166.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

Enhancing clinical concept extraction with contextual embeddings.

J Am Med Inform Assoc. 2019 Nov 1;26(11):1297-1304. doi: 10.1093/jamia/ocz096.

A study of deep learning approaches for medication and adverse drug event extraction from clinical text.

J Am Med Inform Assoc. 2020 Jan 1;27(1):13-21. doi: 10.1093/jamia/ocz063.

LSTMVoter: chemical named entity recognition using a conglomerate of sequence labeling tools.

J Cheminform. 2019 Jan 10;11(1):3. doi: 10.1186/s13321-018-0327-2.

Chemlistem: chemical named entity recognition using recurrent neural networks.

J Cheminform. 2018 Dec 6;10(1):59. doi: 10.1186/s13321-018-0313-8.

Deep learning with word embeddings improves biomedical named entity recognition.

Bioinformatics. 2017 Jul 15;33(14):i37-i48. doi: 10.1093/bioinformatics/btx228.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于健康与生命科学语料库中有效命名实体识别的深度掩码语言模型集成

Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献