Shorten Connor, Khoshgoftaar Taghi M, Furht Borko
Florida Atlantic University, 777 Glades Road, Boca Raton, FL 33431 USA.
J Big Data. 2021;8(1):101. doi: 10.1186/s40537-021-00492-0. Epub 2021 Jul 19.
Natural Language Processing (NLP) is one of the most captivating applications of Deep Learning. In this survey, we consider how the Data Augmentation training strategy can aid in its development. We begin with the major motifs of Data Augmentation summarized into strengthening local decision boundaries, brute force training, causality and counterfactual examples, and the distinction between meaning and form. We follow these motifs with a concrete list of augmentation frameworks that have been developed for text data. Deep Learning generally struggles with the measurement of generalization and characterization of overfitting. We highlight studies that cover how augmentations can construct test sets for generalization. NLP is at an early stage in applying Data Augmentation compared to Computer Vision. We highlight the key differences and promising ideas that have yet to be tested in NLP. For the sake of practical implementation, we describe tools that facilitate Data Augmentation such as the use of consistency regularization, controllers, and offline and online augmentation pipelines, to preview a few. Finally, we discuss interesting topics around Data Augmentation in NLP such as task-specific augmentations, the use of prior knowledge in self-supervised learning versus Data Augmentation, intersections with transfer and multi-task learning, and ideas for AI-GAs (AI-Generating Algorithms). We hope this paper inspires further research interest in Text Data Augmentation.
自然语言处理(NLP)是深度学习中最引人入胜的应用之一。在本次综述中,我们探讨数据增强训练策略如何助力其发展。我们首先总结数据增强的主要主题,包括强化局部决策边界、强力训练、因果关系和反事实示例,以及意义与形式之间的区别。接着,我们给出一份为文本数据开发的增强框架的具体列表。深度学习在泛化测量和过拟合表征方面通常存在困难。我们重点介绍了一些研究,这些研究阐述了增强如何构建用于泛化的测试集。与计算机视觉相比,NLP在应用数据增强方面尚处于早期阶段。我们突出了尚未在NLP中进行测试的关键差异和有前景的想法。为了实际应用,我们描述了一些便于数据增强的工具,比如一致性正则化的使用、控制器以及离线和在线增强管道等,仅列举几个。最后,我们讨论了NLP中围绕数据增强的一些有趣话题,如特定任务增强、自监督学习中先验知识与数据增强的使用、与迁移学习和多任务学习的交叉点,以及人工智能生成算法(AI-GAs)的相关想法。我们希望本文能激发对文本数据增强的进一步研究兴趣。
J Big Data. 2021
JMIR Med Inform. 2020-3-31
Neural Netw. 2022-1
Front Med (Lausanne). 2022-8-8
Comput Struct Biotechnol J. 2021-3-25
Adv Chronic Kidney Dis. 2022-9
J Biomed Inform. 2023-9
Quant Imaging Med Surg. 2025-9-1
Front Big Data. 2025-8-13
AMIA Jt Summits Transl Sci Proc. 2021
IEEE Trans Neural Netw Learn Syst. 2022-2
J Big Data. 2021
J Proteome Res. 2020-7-24
JMIR Med Inform. 2020-3-31
IEEE Trans Pattern Anal Mach Intell. 2021-12
Sci Data. 2016-5-24