• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于 UMLS 的临床研究文献自然语言处理的数据增强。

UMLS-based data augmentation for natural language processing of clinical research literature.

机构信息

Department of Biomedical Informatics, Columbia University, New York, New York, USA.

出版信息

J Am Med Inform Assoc. 2021 Mar 18;28(4):812-823. doi: 10.1093/jamia/ocaa309.

DOI:10.1093/jamia/ocaa309
PMID:33367705
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7973470/
Abstract

OBJECTIVE

The study sought to develop and evaluate a knowledge-based data augmentation method to improve the performance of deep learning models for biomedical natural language processing by overcoming training data scarcity.

MATERIALS AND METHODS

We extended the easy data augmentation (EDA) method for biomedical named entity recognition (NER) by incorporating the Unified Medical Language System (UMLS) knowledge and called this method UMLS-EDA. We designed experiments to systematically evaluate the effect of UMLS-EDA on popular deep learning architectures for both NER and classification. We also compared UMLS-EDA to BERT.

RESULTS

UMLS-EDA enables substantial improvement for NER tasks from the original long short-term memory conditional random fields (LSTM-CRF) model (micro-F1 score: +5%, + 17%, and +15%), helps the LSTM-CRF model (micro-F1 score: 0.66) outperform LSTM-CRF with transfer learning by BERT (0.63), and improves the performance of the state-of-the-art sentence classification model. The largest gain on micro-F1 score is 9%, from 0.75 to 0.84, better than classifiers with BERT pretraining (0.82).

CONCLUSIONS

This study presents a UMLS-based data augmentation method, UMLS-EDA. It is effective at improving deep learning models for both NER and sentence classification, and contributes original insights for designing new, superior deep learning approaches for low-resource biomedical domains.

摘要

目的

本研究旨在开发和评估一种基于知识的数据增强方法,通过克服训练数据匮乏的问题,提高深度学习模型在生物医学自然语言处理中的性能。

材料与方法

我们通过整合统一医学语言系统(UMLS)知识扩展了生物医学命名实体识别(NER)的简易数据增强(EDA)方法,并将这种方法命名为 UMLS-EDA。我们设计了实验,以系统地评估 UMLS-EDA 对流行的深度学习架构在 NER 和分类任务中的影响。我们还将 UMLS-EDA 与 BERT 进行了比较。

结果

UMLS-EDA 使得原始长短期记忆条件随机场(LSTM-CRF)模型的 NER 任务得到了实质性的改进(微观 F1 得分:+5%、+17%和+15%),帮助 LSTM-CRF 模型(微观 F1 得分:0.66)在没有 BERT 迁移学习的情况下优于 LSTM-CRF(0.63),并提高了最先进的句子分类模型的性能。微观 F1 得分的最大增益为 9%,从 0.75 提高到 0.84,优于具有 BERT 预训练的分类器(0.82)。

结论

本研究提出了一种基于 UMLS 的数据增强方法 UMLS-EDA。它在提高 NER 和句子分类的深度学习模型方面非常有效,并为设计新的、优越的低资源生物医学领域深度学习方法提供了原创性的见解。

相似文献

1
UMLS-based data augmentation for natural language processing of clinical research literature.基于 UMLS 的临床研究文献自然语言处理的数据增强。
J Am Med Inform Assoc. 2021 Mar 18;28(4):812-823. doi: 10.1093/jamia/ocaa309.
2
Extracting comprehensive clinical information for breast cancer using deep learning methods.利用深度学习方法提取乳腺癌全面临床信息。
Int J Med Inform. 2019 Dec;132:103985. doi: 10.1016/j.ijmedinf.2019.103985. Epub 2019 Oct 2.
3
Evaluation of clinical named entity recognition methods for Serbian electronic health records.评估塞尔维亚电子健康记录中的临床命名实体识别方法。
Int J Med Inform. 2022 Aug;164:104805. doi: 10.1016/j.ijmedinf.2022.104805. Epub 2022 May 25.
4
Evaluating Medical Entity Recognition in Health Care: Entity Model Quantitative Study.评估医疗保健中的实体识别:实体模型定量研究。
JMIR Med Inform. 2024 Oct 17;12:e59782. doi: 10.2196/59782.
5
A deep learning model incorporating part of speech and self-matching attention for named entity recognition of Chinese electronic medical records.基于词性和自匹配注意力的深度学习模型在中文电子病历命名实体识别中的应用。
BMC Med Inform Decis Mak. 2019 Apr 9;19(Suppl 2):65. doi: 10.1186/s12911-019-0762-7.
6
A study of deep learning approaches for medication and adverse drug event extraction from clinical text.深度学习方法在从临床文本中提取药物和药物不良事件的研究。
J Am Med Inform Assoc. 2020 Jan 1;27(1):13-21. doi: 10.1093/jamia/ocz063.
7
On the role of the UMLS in supporting diagnosis generation proposed by Large Language Models.在支持大型语言模型提出的诊断生成中 UMLS 的作用。
J Biomed Inform. 2024 Sep;157:104707. doi: 10.1016/j.jbi.2024.104707. Epub 2024 Aug 13.
8
Unified Medical Language System resources improve sieve-based generation and Bidirectional Encoder Representations from Transformers (BERT)-based ranking for concept normalization.统一医学语言系统资源提高了基于筛子的生成和基于双向编码器表示的转换器(BERT)的排名,以实现概念归一化。
J Am Med Inform Assoc. 2020 Oct 1;27(10):1510-1519. doi: 10.1093/jamia/ocaa080.
9
Assessing the enrichment of dietary supplement coverage in the Unified Medical Language System.评估统一医学语言系统中膳食补充剂覆盖范围的丰富程度。
J Am Med Inform Assoc. 2020 Oct 1;27(10):1547-1555. doi: 10.1093/jamia/ocaa128.
10
Analyzing transfer learning impact in biomedical cross-lingual named entity recognition and normalization.分析迁移学习在生物医学跨语言命名实体识别和标准化中的影响。
BMC Bioinformatics. 2021 Dec 17;22(Suppl 1):601. doi: 10.1186/s12859-021-04247-9.

引用本文的文献

1
A text mining-based approach for comprehensive understanding of Chinese railway operational equipment failure reports.一种基于文本挖掘的方法,用于全面理解中国铁路运营设备故障报告。
Sci Rep. 2025 Jul 30;15(1):27760. doi: 10.1038/s41598-025-11622-6.
2
SPIRIT-CONSORT-TM: a corpus for assessing transparency of clinical trial protocol and results publications.SPIRIT-CONSORT-TM:一个用于评估临床试验方案和结果出版物透明度的语料库。
Sci Data. 2025 Feb 28;12(1):355. doi: 10.1038/s41597-025-04629-1.
3
SPIRIT-CONSORT-TM: a corpus for assessing transparency of clinical trial protocol and results publications.SPIRIT-CONSORT-TM:一个用于评估临床试验方案和结果出版物透明度的语料库。
medRxiv. 2025 Jan 15:2025.01.14.25320543. doi: 10.1101/2025.01.14.25320543.
4
Google trend analysis of the Indian population reveals a panel of seasonally sensitive comorbid symptoms with implications for monitoring the seasonally sensitive human population.对印度人口的谷歌趋势分析揭示了一组季节性敏感的共病症状,这对监测季节性敏感人群具有重要意义。
Popul Health Metr. 2024 Dec 30;22(1):40. doi: 10.1186/s12963-024-00349-7.
5
Text classification models for assessing the completeness of randomized controlled trial publications based on CONSORT reporting guidelines.基于 CONSORT 报告规范的评估随机对照试验出版物完整性的文本分类模型。
Sci Rep. 2024 Sep 17;14(1):21721. doi: 10.1038/s41598-024-72130-7.
6
Identifying the Question Similarity of Regulatory Documents in the Pharmaceutical Industry by Using the Recognizing Question Entailment System: Evaluation Study.利用识别问题蕴含系统识别制药行业监管文件中的问题相似性:评估研究
JMIR AI. 2023 Sep 26;2:e43483. doi: 10.2196/43483.
7
Enhancing the coverage of SemRep using a relation classification approach.利用关系分类方法增强 SemRep 的覆盖范围。
J Biomed Inform. 2024 Jul;155:104658. doi: 10.1016/j.jbi.2024.104658. Epub 2024 May 21.
8
CONSORT-TM: Text classification models for assessing the completeness of randomized controlled trial publications.CONSORT-TM:用于评估随机对照试验出版物完整性的文本分类模型。
medRxiv. 2024 Apr 1:2024.03.31.24305138. doi: 10.1101/2024.03.31.24305138.
9
Automatic categorization of self-acknowledged limitations in randomized controlled trial publications.自我承认的随机对照试验出版物局限性的自动分类。
J Biomed Inform. 2024 Apr;152:104628. doi: 10.1016/j.jbi.2024.104628. Epub 2024 Mar 26.
10
Enhancing biomechanical machine learning with limited data: generating realistic synthetic posture data using generative artificial intelligence.利用有限数据增强生物力学机器学习:使用生成式人工智能生成逼真的合成姿势数据。
Front Bioeng Biotechnol. 2024 Feb 14;12:1350135. doi: 10.3389/fbioe.2024.1350135. eCollection 2024.

本文引用的文献

1
BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT:一种用于生物医学文本挖掘的预训练生物医学语言表示模型。
Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.
2
Pretraining to Recognize PICO Elements from Randomized Controlled Trial Literature.通过预训练从随机对照试验文献中识别PICO要素
Stud Health Technol Inform. 2019 Aug 21;264:188-192. doi: 10.3233/SHTI190209.
3
A Year of Papers Using Biomedical Texts: Findings from the Section on Natural Language Processing of the IMIA Yearbook.使用生物医学文本的论文之年:IMIA年鉴自然语言处理章节的研究结果
Yearb Med Inform. 2019 Aug;28(1):218-222. doi: 10.1055/s-0039-1677937. Epub 2019 Aug 16.
4
Using distant supervision to augment manually annotated data for relation extraction.利用远程监督来扩充人工标注数据以进行关系抽取。
PLoS One. 2019 Jul 30;14(7):e0216913. doi: 10.1371/journal.pone.0216913. eCollection 2019.
5
Deep learning and alternative learning strategies for retrospective real-world clinical data.用于回顾性真实世界临床数据的深度学习及替代学习策略。
NPJ Digit Med. 2019 May 30;2:43. doi: 10.1038/s41746-019-0122-0. eCollection 2019.
6
Asking Structured, Answerable Clinical Questions Using the Population, Intervention/Comparator, Outcome (PICO) Framework.使用人群、干预措施/对照、结局(PICO)框架提出结构化、可回答的临床问题。
PM R. 2019 May;11(5):548-553. doi: 10.1002/pmrj.12116. Epub 2019 Apr 2.
7
A guide to deep learning in healthcare.深度学习在医疗保健中的应用指南。
Nat Med. 2019 Jan;25(1):24-29. doi: 10.1038/s41591-018-0316-z. Epub 2019 Jan 7.
8
A clinical text classification paradigm using weak supervision and deep representation.一种使用弱监督和深度表示的临床文本分类范式。
BMC Med Inform Decis Mak. 2019 Jan 7;19(1):1. doi: 10.1186/s12911-018-0723-6.
9
A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature.一个带有患者、干预措施和结果的多层次注释的语料库,以支持医学文献的语言处理。
Proc Conf Assoc Comput Linguist Meet. 2018 Jul;2018:197-207.
10
Artificial intelligence in fracture detection: transfer learning from deep convolutional neural networks.骨折检测中的人工智能:基于深度卷积神经网络的迁移学习
Clin Radiol. 2018 May;73(5):439-445. doi: 10.1016/j.crad.2017.11.015. Epub 2017 Dec 18.