Department of Biomedical Informatics, Columbia University, New York, New York, USA.
J Am Med Inform Assoc. 2021 Mar 18;28(4):812-823. doi: 10.1093/jamia/ocaa309.
The study sought to develop and evaluate a knowledge-based data augmentation method to improve the performance of deep learning models for biomedical natural language processing by overcoming training data scarcity.
We extended the easy data augmentation (EDA) method for biomedical named entity recognition (NER) by incorporating the Unified Medical Language System (UMLS) knowledge and called this method UMLS-EDA. We designed experiments to systematically evaluate the effect of UMLS-EDA on popular deep learning architectures for both NER and classification. We also compared UMLS-EDA to BERT.
UMLS-EDA enables substantial improvement for NER tasks from the original long short-term memory conditional random fields (LSTM-CRF) model (micro-F1 score: +5%, + 17%, and +15%), helps the LSTM-CRF model (micro-F1 score: 0.66) outperform LSTM-CRF with transfer learning by BERT (0.63), and improves the performance of the state-of-the-art sentence classification model. The largest gain on micro-F1 score is 9%, from 0.75 to 0.84, better than classifiers with BERT pretraining (0.82).
This study presents a UMLS-based data augmentation method, UMLS-EDA. It is effective at improving deep learning models for both NER and sentence classification, and contributes original insights for designing new, superior deep learning approaches for low-resource biomedical domains.
本研究旨在开发和评估一种基于知识的数据增强方法,通过克服训练数据匮乏的问题,提高深度学习模型在生物医学自然语言处理中的性能。
我们通过整合统一医学语言系统(UMLS)知识扩展了生物医学命名实体识别(NER)的简易数据增强(EDA)方法,并将这种方法命名为 UMLS-EDA。我们设计了实验,以系统地评估 UMLS-EDA 对流行的深度学习架构在 NER 和分类任务中的影响。我们还将 UMLS-EDA 与 BERT 进行了比较。
UMLS-EDA 使得原始长短期记忆条件随机场(LSTM-CRF)模型的 NER 任务得到了实质性的改进(微观 F1 得分:+5%、+17%和+15%),帮助 LSTM-CRF 模型(微观 F1 得分:0.66)在没有 BERT 迁移学习的情况下优于 LSTM-CRF(0.63),并提高了最先进的句子分类模型的性能。微观 F1 得分的最大增益为 9%,从 0.75 提高到 0.84,优于具有 BERT 预训练的分类器(0.82)。
本研究提出了一种基于 UMLS 的数据增强方法 UMLS-EDA。它在提高 NER 和句子分类的深度学习模型方面非常有效,并为设计新的、优越的低资源生物医学领域深度学习方法提供了原创性的见解。