用于深度学习的文本数据增强

Text Data Augmentation for Deep Learning.

作者信息

Shorten Connor, Khoshgoftaar Taghi M, Furht Borko

机构信息

Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431 USA.

出版信息

J Big Data. 2021;8(1):101. doi: 10.1186/s40537-021-00492-0. Epub 2021 Jul 19.

DOI:10.1186/s40537-021-00492-0

PMID:34306963

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8287113/

Abstract

Natural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form. We follow these motifs with a concrete list of augmentation frameworks that have been developed for text data. Deep Learning generally struggles with the measurement of generalization and characterization of overfitting. We highlight studies that cover how augmentations can construct test sets for generalization. NLP is at an early stage in applying Data Augmentation compared to Computer Vision. We highlight the key differences and promising ideas that have yet to be tested in NLP. For the sake of practical implementation, we describe tools that facilitate Data Augmentation such as the use of consistency regularization, controllers, and offline and online augmentation pipelines, to preview a few. Finally, we discuss interesting topics around Data Augmentation in NLP such as task-specific augmentations, the use of prior knowledge in self-supervised learning versus Data Augmentation, intersections with transfer and multi-task learning, and ideas for AI-GAs (AI-Generating Algorithms). We hope this paper inspires further research interest in Text Data Augmentation.

摘要

自然语言处理（NLP）是深度学习中最引人入胜的应用之一。在本次综述中，我们探讨数据增强训练策略如何助力其发展。我们首先总结数据增强的主要主题，包括强化局部决策边界、强力训练、因果关系和反事实示例，以及意义与形式之间的区别。接着，我们给出一份为文本数据开发的增强框架的具体列表。深度学习在泛化测量和过拟合表征方面通常存在困难。我们重点介绍了一些研究，这些研究阐述了增强如何构建用于泛化的测试集。与计算机视觉相比，NLP在应用数据增强方面尚处于早期阶段。我们突出了尚未在NLP中进行测试的关键差异和有前景的想法。为了实际应用，我们描述了一些便于数据增强的工具，比如一致性正则化的使用、控制器以及离线和在线增强管道等，仅列举几个。最后，我们讨论了NLP中围绕数据增强的一些有趣话题，如特定任务增强、自监督学习中先验知识与数据增强的使用、与迁移学习和多任务学习的交叉点，以及人工智能生成算法（AI-GAs）的相关想法。我们希望本文能激发对文本数据增强的进一步研究兴趣。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3142/8287113/d81a3d50354f/40537_2021_492_Fig1_HTML.jpg

相似文献

Text Data Augmentation for Deep Learning.

J Big Data. 2021;8(1):101. doi: 10.1186/s40537-021-00492-0. Epub 2021 Jul 19.

Clinical Text Data in Machine Learning: Systematic Review.

JMIR Med Inform. 2020 Mar 31;8(3):e17984. doi: 10.2196/17984.

Learning policy scheduling for text augmentation.

Neural Netw. 2022 Jan;145:121-127. doi: 10.1016/j.neunet.2021.09.028. Epub 2021 Oct 11.

The Effectiveness of Image Augmentation in Deep Learning Networks for Detecting COVID-19: A Geometric Transformation Perspective.

Front Med (Lausanne). 2021 Mar 1;8:629134. doi: 10.3389/fmed.2021.629134. eCollection 2021.

Applications of natural language processing in ophthalmology: present and future.

Front Med (Lausanne). 2022 Aug 8;9:906554. doi: 10.3389/fmed.2022.906554. eCollection 2022.

Improving the robustness and accuracy of biomedical language models through adversarial training.

J Biomed Inform. 2022 Aug;132:104114. doi: 10.1016/j.jbi.2022.104114. Epub 2022 Jun 15.

The language of proteins: NLP, machine learning & protein sequences.

Comput Struct Biotechnol J. 2021 Mar 25;19:1750-1758. doi: 10.1016/j.csbj.2021.03.022. eCollection 2021.

Natural Language Processing in Nephrology.

Adv Chronic Kidney Dis. 2022 Sep;29(5):465-471. doi: 10.1053/j.ackd.2022.07.001.

Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages.

PeerJ Comput Sci. 2024 Mar 29;10:e1974. doi: 10.7717/peerj-cs.1974. eCollection 2024.

Annotated dataset creation through large language models for non-english medical NLP.

J Biomed Inform. 2023 Sep;145:104478. doi: 10.1016/j.jbi.2023.104478. Epub 2023 Aug 23.

引用本文的文献

A Transformer-Based Framework With Data Augmentation for Robust Seizure Detection Across Invasive and Noninvasive Neural Recordings.

CNS Neurosci Ther. 2025 Sep;31(9):e70584. doi: 10.1111/cns.70584.

Leveraging synthetic data produced from museum specimens to train adaptable species classification models.

PLoS One. 2025 Sep 3;20(9):e0329482. doi: 10.1371/journal.pone.0329482. eCollection 2025.

Deep learning-based semantic segmentation for rice yield estimation by analyzing the dynamic change of panicle coverage.

Front Plant Sci. 2025 Aug 14;16:1611653. doi: 10.3389/fpls.2025.1611653. eCollection 2025.

Combining curriculum learning and weakly supervised attention for enhanced thyroid nodule assessment in ultrasound imaging.

Quant Imaging Med Surg. 2025 Sep 1;15(9):8579-8593. doi: 10.21037/qims-24-2431. Epub 2025 Aug 18.

Detecting papilloedema as a marker of raised intracranial pressure using artificial intelligence: A systematic review.

PLOS Digit Health. 2025 Sep 2;4(9):e0000783. doi: 10.1371/journal.pdig.0000783. eCollection 2025 Sep.

Optimizing anisotropic margins in single-isocenter multiple brain metastases radiosurgery using regressor strategies: A multi-institutional validation study.

J Appl Clin Med Phys. 2025 Sep;26(9):e70249. doi: 10.1002/acm2.70249.

Evaluation of deep learning models using explainable AI with qualitative and quantitative analysis for rice leaf disease detection.

Sci Rep. 2025 Aug 29;15(1):31850. doi: 10.1038/s41598-025-14306-3.

Automated road surface classification in OpenStreetMap using MaskCNN and aerial imagery.

Front Big Data. 2025 Aug 13;8:1657320. doi: 10.3389/fdata.2025.1657320. eCollection 2025.

Foliar disease resistance phenomics of fungal pathogens: image-based approaches for mapping quantitative resistance in cereal germplasm.

Theor Appl Genet. 2025 Aug 28;138(9):232. doi: 10.1007/s00122-025-05017-4.

Detection and Recognition of Bilingual Urdu and English Text in Natural Scene Images Using a Convolutional Neural Network-Recurrent Neural Network Combination with a Connectionist Temporal Classification Decoder.

Sensors (Basel). 2025 Aug 19;25(16):5133. doi: 10.3390/s25165133.

本文引用的文献

Trajectory Inspection: A Method for Iterative Clinician-Driven Design of Reinforcement Learning Studies.

AMIA Jt Summits Transl Sci Proc. 2021 May 17;2021:305-314. eCollection 2021.

A Survey on Knowledge Graphs: Representation, Acquisition, and Applications.

IEEE Trans Neural Netw Learn Syst. 2022 Feb;33(2):494-514. doi: 10.1109/TNNLS.2021.3070843. Epub 2022 Feb 3.

Deep Learning applications for COVID-19.

J Big Data. 2021;8(1):18. doi: 10.1186/s40537-020-00392-9. Epub 2021 Jan 11.

Repurpose Open Data to Discover Therapeutics for COVID-19 Using Deep Learning.

J Proteome Res. 2020 Nov 6;19(11):4624-4636. doi: 10.1021/acs.jproteome.0c00316. Epub 2020 Jul 24.

Clinical Text Data in Machine Learning: Systematic Review.

JMIR Med Inform. 2020 Mar 31;8(3):e17984. doi: 10.2196/17984.

A Style-Based Generator Architecture for Generative Adversarial Networks.

IEEE Trans Pattern Anal Mach Intell. 2021 Dec;43(12):4217-4228. doi: 10.1109/TPAMI.2020.2970919. Epub 2021 Nov 3.

MIMIC-III, a freely accessible critical care database.

Sci Data. 2016 May 24;3:160035. doi: 10.1038/sdata.2016.35.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于深度学习的文本数据增强

Text Data Augmentation for Deep Learning.

作者信息

Shorten Connor, Khoshgoftaar Taghi M, Furht Borko

机构信息

Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431 USA.

出版信息

J Big Data. 2021;8(1):101. doi: 10.1186/s40537-021-00492-0. Epub 2021 Jul 19.

DOI:10.1186/s40537-021-00492-0

PMID:34306963

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8287113/

Abstract

摘要

用于深度学习的文本数据增强

Text Data Augmentation for Deep Learning.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

用于深度学习的文本数据增强

Text Data Augmentation for Deep Learning.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献