Suppr超能文献

基于 UMLS 的临床研究文献自然语言处理的数据增强。

UMLS-based data augmentation for natural language processing of clinical research literature.

机构信息

Department of Biomedical Informatics, Columbia University, New York, New York, USA.

出版信息

J Am Med Inform Assoc. 2021 Mar 18;28(4):812-823. doi: 10.1093/jamia/ocaa309.

Abstract

OBJECTIVE

The study sought to develop and evaluate a knowledge-based data augmentation method to improve the performance of deep learning models for biomedical natural language processing by overcoming training data scarcity.

MATERIALS AND METHODS

We extended the easy data augmentation (EDA) method for biomedical named entity recognition (NER) by incorporating the Unified Medical Language System (UMLS) knowledge and called this method UMLS-EDA. We designed experiments to systematically evaluate the effect of UMLS-EDA on popular deep learning architectures for both NER and classification. We also compared UMLS-EDA to BERT.

RESULTS

UMLS-EDA enables substantial improvement for NER tasks from the original long short-term memory conditional random fields (LSTM-CRF) model (micro-F1 score: +5%, + 17%, and +15%), helps the LSTM-CRF model (micro-F1 score: 0.66) outperform LSTM-CRF with transfer learning by BERT (0.63), and improves the performance of the state-of-the-art sentence classification model. The largest gain on micro-F1 score is 9%, from 0.75 to 0.84, better than classifiers with BERT pretraining (0.82).

CONCLUSIONS

This study presents a UMLS-based data augmentation method, UMLS-EDA. It is effective at improving deep learning models for both NER and sentence classification, and contributes original insights for designing new, superior deep learning approaches for low-resource biomedical domains.

摘要

目的

本研究旨在开发和评估一种基于知识的数据增强方法,通过克服训练数据匮乏的问题,提高深度学习模型在生物医学自然语言处理中的性能。

材料与方法

我们通过整合统一医学语言系统(UMLS)知识扩展了生物医学命名实体识别(NER)的简易数据增强(EDA)方法,并将这种方法命名为 UMLS-EDA。我们设计了实验,以系统地评估 UMLS-EDA 对流行的深度学习架构在 NER 和分类任务中的影响。我们还将 UMLS-EDA 与 BERT 进行了比较。

结果

UMLS-EDA 使得原始长短期记忆条件随机场(LSTM-CRF)模型的 NER 任务得到了实质性的改进(微观 F1 得分:+5%、+17%和+15%),帮助 LSTM-CRF 模型(微观 F1 得分:0.66)在没有 BERT 迁移学习的情况下优于 LSTM-CRF(0.63),并提高了最先进的句子分类模型的性能。微观 F1 得分的最大增益为 9%,从 0.75 提高到 0.84,优于具有 BERT 预训练的分类器(0.82)。

结论

本研究提出了一种基于 UMLS 的数据增强方法 UMLS-EDA。它在提高 NER 和句子分类的深度学习模型方面非常有效,并为设计新的、优越的低资源生物医学领域深度学习方法提供了原创性的见解。

相似文献

1
UMLS-based data augmentation for natural language processing of clinical research literature.
J Am Med Inform Assoc. 2021 Mar 18;28(4):812-823. doi: 10.1093/jamia/ocaa309.
2
Extracting comprehensive clinical information for breast cancer using deep learning methods.
Int J Med Inform. 2019 Dec;132:103985. doi: 10.1016/j.ijmedinf.2019.103985. Epub 2019 Oct 2.
3
Evaluation of clinical named entity recognition methods for Serbian electronic health records.
Int J Med Inform. 2022 Aug;164:104805. doi: 10.1016/j.ijmedinf.2022.104805. Epub 2022 May 25.
4
Evaluating Medical Entity Recognition in Health Care: Entity Model Quantitative Study.
JMIR Med Inform. 2024 Oct 17;12:e59782. doi: 10.2196/59782.
6
A study of deep learning approaches for medication and adverse drug event extraction from clinical text.
J Am Med Inform Assoc. 2020 Jan 1;27(1):13-21. doi: 10.1093/jamia/ocz063.
7
On the role of the UMLS in supporting diagnosis generation proposed by Large Language Models.
J Biomed Inform. 2024 Sep;157:104707. doi: 10.1016/j.jbi.2024.104707. Epub 2024 Aug 13.
9
Assessing the enrichment of dietary supplement coverage in the Unified Medical Language System.
J Am Med Inform Assoc. 2020 Oct 1;27(10):1547-1555. doi: 10.1093/jamia/ocaa128.
10
Analyzing transfer learning impact in biomedical cross-lingual named entity recognition and normalization.
BMC Bioinformatics. 2021 Dec 17;22(Suppl 1):601. doi: 10.1186/s12859-021-04247-9.

引用本文的文献

3
SPIRIT-CONSORT-TM: a corpus for assessing transparency of clinical trial protocol and results publications.
medRxiv. 2025 Jan 15:2025.01.14.25320543. doi: 10.1101/2025.01.14.25320543.
7
Enhancing the coverage of SemRep using a relation classification approach.
J Biomed Inform. 2024 Jul;155:104658. doi: 10.1016/j.jbi.2024.104658. Epub 2024 May 21.
9
Automatic categorization of self-acknowledged limitations in randomized controlled trial publications.
J Biomed Inform. 2024 Apr;152:104628. doi: 10.1016/j.jbi.2024.104628. Epub 2024 Mar 26.
10
Enhancing biomechanical machine learning with limited data: generating realistic synthetic posture data using generative artificial intelligence.
Front Bioeng Biotechnol. 2024 Feb 14;12:1350135. doi: 10.3389/fbioe.2024.1350135. eCollection 2024.

本文引用的文献

1
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
2
Pretraining to Recognize PICO Elements from Randomized Controlled Trial Literature.
Stud Health Technol Inform. 2019 Aug 21;264:188-192. doi: 10.3233/SHTI190209.
3
A Year of Papers Using Biomedical Texts: Findings from the Section on Natural Language Processing of the IMIA Yearbook.
Yearb Med Inform. 2019 Aug;28(1):218-222. doi: 10.1055/s-0039-1677937. Epub 2019 Aug 16.
4
Using distant supervision to augment manually annotated data for relation extraction.
PLoS One. 2019 Jul 30;14(7):e0216913. doi: 10.1371/journal.pone.0216913. eCollection 2019.
5
Deep learning and alternative learning strategies for retrospective real-world clinical data.
NPJ Digit Med. 2019 May 30;2:43. doi: 10.1038/s41746-019-0122-0. eCollection 2019.
7
A guide to deep learning in healthcare.
Nat Med. 2019 Jan;25(1):24-29. doi: 10.1038/s41591-018-0316-z. Epub 2019 Jan 7.
8
A clinical text classification paradigm using weak supervision and deep representation.
BMC Med Inform Decis Mak. 2019 Jan 7;19(1):1. doi: 10.1186/s12911-018-0723-6.
10
Artificial intelligence in fracture detection: transfer learning from deep convolutional neural networks.
Clin Radiol. 2018 May;73(5):439-445. doi: 10.1016/j.crad.2017.11.015. Epub 2017 Dec 18.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验