Abdhood Samia F, Omar Nazlia, Tiun Sabrina
Center for Artificial Intelligence Technology, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor, Malaysia.
Faculty of Computers and Information Technology, Hadhramout University, Almukalla, Hadhramout, Yemen.
PeerJ Comput Sci. 2025 Mar 10;11:e2685. doi: 10.7717/peerj-cs.2685. eCollection 2025.
The effectiveness of data augmentation techniques, ., methods for artificially creating new data, has been demonstrated in many domains, from images to textual data. Data augmentation methods were established to manage different issues regarding the scarcity of training datasets or the class imbalance to enhance the performance of classifiers. This review article investigates data augmentation techniques for Arabic texts, specifically in the text classification field. A thorough review was conducted to give a concise and comprehensive understanding of these approaches in the context of Arabic classification. The focus of this article is on Arabic studies published from 2019 to 2024 about data augmentation in Arabic text classification. Inclusion and exclusion criteria were applied to ensure a comprehensive vision of these techniques in Arabic natural language processing (ANLP). It was found that data augmentation research for Arabic text classification dominates sentiment analysis and propaganda detection, with initial studies emerging in 2019; very few studies have investigated other domains like sarcasm detection or text categorization. We also observed the lack of benchmark datasets for performing the tasks. Most studies have focused on short texts, such as Twitter data or reviews, while research on long texts still needs to be explored. Additionally, various data augmentation methods still need to be examined for long texts to determine if techniques effective for short texts are also applicable to longer texts. A rigorous investigation and comparison of the most effective strategies is required due to the unique characteristics of the Arabic language. By doing so, we can better understand the processes involved in Arabic text classification and hence be able to select the most suitable data augmentation methods for specific tasks. This review contributes valuable insights into Arabic NLP and enriches the existing body of knowledge.
数据增强技术,即人工创建新数据的方法,已在从图像到文本数据的许多领域得到证明。数据增强方法的建立是为了处理与训练数据集稀缺或类别不平衡相关的不同问题,以提高分类器的性能。这篇综述文章研究了阿拉伯语文本的数据增强技术,特别是在文本分类领域。进行了全面的综述,以便在阿拉伯语分类的背景下对这些方法有一个简洁而全面的理解。本文的重点是2019年至2024年发表的关于阿拉伯语文本分类中数据增强的阿拉伯语研究。应用了纳入和排除标准,以确保对阿拉伯语自然语言处理(ANLP)中的这些技术有一个全面的认识。研究发现,阿拉伯语文本分类的数据增强研究主要集中在情感分析和宣传检测方面,2019年出现了初步研究;很少有研究调查其他领域,如讽刺检测或文本分类。我们还观察到执行这些任务缺乏基准数据集。大多数研究都集中在短文本上,如推特数据或评论,而对长文本的研究仍有待探索。此外,对于长文本,各种数据增强方法仍需进行研究,以确定对短文本有效的技术是否也适用于长文本。由于阿拉伯语的独特特征,需要对最有效的策略进行严格的调查和比较。通过这样做,我们可以更好地理解阿拉伯语文本分类所涉及的过程,从而能够为特定任务选择最合适的数据增强方法。这篇综述为阿拉伯语自然语言处理提供了有价值的见解,并丰富了现有的知识体系。