文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

用于深度学习的文本数据增强

Text Data Augmentation for Deep Learning.

作者信息

Shorten Connor, Khoshgoftaar Taghi M, Furht Borko

机构信息

Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431 USA.

出版信息

J Big Data. 2021;8(1):101. doi: 10.1186/s40537-021-00492-0. Epub 2021 Jul 19.


DOI:10.1186/s40537-021-00492-0
PMID:34306963
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8287113/
Abstract

Natural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form. We follow these motifs with a concrete list of augmentation frameworks that have been developed for text data. Deep Learning generally struggles with the measurement of generalization and characterization of overfitting. We highlight studies that cover how augmentations can construct test sets for generalization. NLP is at an early stage in applying Data Augmentation compared to Computer Vision. We highlight the key differences and promising ideas that have yet to be tested in NLP. For the sake of practical implementation, we describe tools that facilitate Data Augmentation such as the use of consistency regularization, controllers, and offline and online augmentation pipelines, to preview a few. Finally, we discuss interesting topics around Data Augmentation in NLP such as task-specific augmentations, the use of prior knowledge in self-supervised learning versus Data Augmentation, intersections with transfer and multi-task learning, and ideas for AI-GAs (AI-Generating Algorithms). We hope this paper inspires further research interest in Text Data Augmentation.

摘要

自然语言处理(NLP)是深度学习中最引人入胜的应用之一。在本次综述中,我们探讨数据增强训练策略如何助力其发展。我们首先总结数据增强的主要主题,包括强化局部决策边界、强力训练、因果关系和反事实示例,以及意义与形式之间的区别。接着,我们给出一份为文本数据开发的增强框架的具体列表。深度学习在泛化测量和过拟合表征方面通常存在困难。我们重点介绍了一些研究,这些研究阐述了增强如何构建用于泛化的测试集。与计算机视觉相比,NLP在应用数据增强方面尚处于早期阶段。我们突出了尚未在NLP中进行测试的关键差异和有前景的想法。为了实际应用,我们描述了一些便于数据增强的工具,比如一致性正则化的使用、控制器以及离线和在线增强管道等,仅列举几个。最后,我们讨论了NLP中围绕数据增强的一些有趣话题,如特定任务增强、自监督学习中先验知识与数据增强的使用、与迁移学习和多任务学习的交叉点,以及人工智能生成算法(AI-GAs)的相关想法。我们希望本文能激发对文本数据增强的进一步研究兴趣。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3142/8287113/ae645347a150/40537_2021_492_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3142/8287113/d81a3d50354f/40537_2021_492_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3142/8287113/81f2fd2b8a98/40537_2021_492_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3142/8287113/73a13ea3ec64/40537_2021_492_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3142/8287113/be298d3f3797/40537_2021_492_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3142/8287113/200fb1aa8403/40537_2021_492_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3142/8287113/fee535a0f650/40537_2021_492_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3142/8287113/ae645347a150/40537_2021_492_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3142/8287113/d81a3d50354f/40537_2021_492_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3142/8287113/81f2fd2b8a98/40537_2021_492_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3142/8287113/73a13ea3ec64/40537_2021_492_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3142/8287113/be298d3f3797/40537_2021_492_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3142/8287113/200fb1aa8403/40537_2021_492_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3142/8287113/fee535a0f650/40537_2021_492_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3142/8287113/ae645347a150/40537_2021_492_Fig7_HTML.jpg

相似文献

[1]
Text Data Augmentation for Deep Learning.

J Big Data. 2021

[2]
Clinical Text Data in Machine Learning: Systematic Review.

JMIR Med Inform. 2020-3-31

[3]
Learning policy scheduling for text augmentation.

Neural Netw. 2022-1

[4]
The Effectiveness of Image Augmentation in Deep Learning Networks for Detecting COVID-19: A Geometric Transformation Perspective.

Front Med (Lausanne). 2021-3-1

[5]
Applications of natural language processing in ophthalmology: present and future.

Front Med (Lausanne). 2022-8-8

[6]
Improving the robustness and accuracy of biomedical language models through adversarial training.

J Biomed Inform. 2022-8

[7]
The language of proteins: NLP, machine learning & protein sequences.

Comput Struct Biotechnol J. 2021-3-25

[8]
Natural Language Processing in Nephrology.

Adv Chronic Kidney Dis. 2022-9

[9]
Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages.

PeerJ Comput Sci. 2024-3-29

[10]
Annotated dataset creation through large language models for non-english medical NLP.

J Biomed Inform. 2023-9

引用本文的文献

[1]
A Transformer-Based Framework With Data Augmentation for Robust Seizure Detection Across Invasive and Noninvasive Neural Recordings.

CNS Neurosci Ther. 2025-9

[2]
Leveraging synthetic data produced from museum specimens to train adaptable species classification models.

PLoS One. 2025-9-3

[3]
Deep learning-based semantic segmentation for rice yield estimation by analyzing the dynamic change of panicle coverage.

Front Plant Sci. 2025-8-14

[4]
Combining curriculum learning and weakly supervised attention for enhanced thyroid nodule assessment in ultrasound imaging.

Quant Imaging Med Surg. 2025-9-1

[5]
Detecting papilloedema as a marker of raised intracranial pressure using artificial intelligence: A systematic review.

PLOS Digit Health. 2025-9-2

[6]
Optimizing anisotropic margins in single-isocenter multiple brain metastases radiosurgery using regressor strategies: A multi-institutional validation study.

J Appl Clin Med Phys. 2025-9

[7]
Evaluation of deep learning models using explainable AI with qualitative and quantitative analysis for rice leaf disease detection.

Sci Rep. 2025-8-29

[8]
Automated road surface classification in OpenStreetMap using MaskCNN and aerial imagery.

Front Big Data. 2025-8-13

[9]
Foliar disease resistance phenomics of fungal pathogens: image-based approaches for mapping quantitative resistance in cereal germplasm.

Theor Appl Genet. 2025-8-28

[10]
Detection and Recognition of Bilingual Urdu and English Text in Natural Scene Images Using a Convolutional Neural Network-Recurrent Neural Network Combination with a Connectionist Temporal Classification Decoder.

Sensors (Basel). 2025-8-19

本文引用的文献

[1]
Trajectory Inspection: A Method for Iterative Clinician-Driven Design of Reinforcement Learning Studies.

AMIA Jt Summits Transl Sci Proc. 2021

[2]
A Survey on Knowledge Graphs: Representation, Acquisition, and Applications.

IEEE Trans Neural Netw Learn Syst. 2022-2

[3]
Deep Learning applications for COVID-19.

J Big Data. 2021

[4]
Repurpose Open Data to Discover Therapeutics for COVID-19 Using Deep Learning.

J Proteome Res. 2020-7-24

[5]
Clinical Text Data in Machine Learning: Systematic Review.

JMIR Med Inform. 2020-3-31

[6]
A Style-Based Generator Architecture for Generative Adversarial Networks.

IEEE Trans Pattern Anal Mach Intell. 2021-12

[7]
MIMIC-III, a freely accessible critical care database.

Sci Data. 2016-5-24

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索