基于 UMLS 的临床研究文献自然语言处理的数据增强。

UMLS-based data augmentation for natural language processing of clinical research literature.

机构信息

Department of Biomedical Informatics, Columbia University, New York, New York, USA.

出版信息

J Am Med Inform Assoc. 2021 Mar 18;28(4):812-823. doi: 10.1093/jamia/ocaa309.

DOI:10.1093/jamia/ocaa309

PMID:33367705

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7973470/

Abstract

OBJECTIVE

The study sought to develop and evaluate a knowledge-based data augmentation method to improve the performance of deep learning models for biomedical natural language processing by overcoming training data scarcity.

MATERIALS AND METHODS

We extended the easy data augmentation (EDA) method for biomedical named entity recognition (NER) by incorporating the Unified Medical Language System (UMLS) knowledge and called this method UMLS-EDA. We designed experiments to systematically evaluate the effect of UMLS-EDA on popular deep learning architectures for both NER and classification. We also compared UMLS-EDA to BERT.

RESULTS

UMLS-EDA enables substantial improvement for NER tasks from the original long short-term memory conditional random fields (LSTM-CRF) model (micro-F1 score: +5%, + 17%, and +15%), helps the LSTM-CRF model (micro-F1 score: 0.66) outperform LSTM-CRF with transfer learning by BERT (0.63), and improves the performance of the state-of-the-art sentence classification model. The largest gain on micro-F1 score is 9%, from 0.75 to 0.84, better than classifiers with BERT pretraining (0.82).

CONCLUSIONS

This study presents a UMLS-based data augmentation method, UMLS-EDA. It is effective at improving deep learning models for both NER and sentence classification, and contributes original insights for designing new, superior deep learning approaches for low-resource biomedical domains.

摘要

目的

本研究旨在开发和评估一种基于知识的数据增强方法，通过克服训练数据匮乏的问题，提高深度学习模型在生物医学自然语言处理中的性能。

材料与方法

我们通过整合统一医学语言系统（UMLS）知识扩展了生物医学命名实体识别（NER）的简易数据增强（EDA）方法，并将这种方法命名为 UMLS-EDA。我们设计了实验，以系统地评估 UMLS-EDA 对流行的深度学习架构在 NER 和分类任务中的影响。我们还将 UMLS-EDA 与 BERT 进行了比较。

结果

UMLS-EDA 使得原始长短期记忆条件随机场（LSTM-CRF）模型的 NER 任务得到了实质性的改进（微观 F1 得分：+5%、+17%和+15%），帮助 LSTM-CRF 模型（微观 F1 得分：0.66）在没有 BERT 迁移学习的情况下优于 LSTM-CRF（0.63），并提高了最先进的句子分类模型的性能。微观 F1 得分的最大增益为 9%，从 0.75 提高到 0.84，优于具有 BERT 预训练的分类器（0.82）。

结论

本研究提出了一种基于 UMLS 的数据增强方法 UMLS-EDA。它在提高 NER 和句子分类的深度学习模型方面非常有效，并为设计新的、优越的低资源生物医学领域深度学习方法提供了原创性的见解。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

基于 UMLS 的临床研究文献自然语言处理的数据增强。

UMLS-based data augmentation for natural language processing of clinical research literature.

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

CONCLUSIONS

目的

材料与方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

基于 UMLS 的临床研究文献自然语言处理的数据增强。

UMLS-based data augmentation for natural language processing of clinical research literature.

机构信息

出版信息

OBJECTIVE

MATERIALS AND METHODS

RESULTS

CONCLUSIONS

目的

材料与方法

结果

结论

相似文献

引用本文的文献

本文引用的文献