• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

阿拉伯语文本分类中的数据增强:当前方法、挑战及未来方向综述

Data augmentation for Arabic text classification: a review of current methods, challenges and prospective directions.

作者信息

Abdhood Samia F, Omar Nazlia, Tiun Sabrina

机构信息

Center for Artificial Intelligence Technology, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor, Malaysia.

Faculty of Computers and Information Technology, Hadhramout University, Almukalla, Hadhramout, Yemen.

出版信息

PeerJ Comput Sci. 2025 Mar 10;11:e2685. doi: 10.7717/peerj-cs.2685. eCollection 2025.

DOI:10.7717/peerj-cs.2685
PMID:40134861
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11935767/
Abstract

The effectiveness of data augmentation techniques, ., methods for artificially creating new data, has been demonstrated in many domains, from images to textual data. Data augmentation methods were established to manage different issues regarding the scarcity of training datasets or the class imbalance to enhance the performance of classifiers. This review article investigates data augmentation techniques for Arabic texts, specifically in the text classification field. A thorough review was conducted to give a concise and comprehensive understanding of these approaches in the context of Arabic classification. The focus of this article is on Arabic studies published from 2019 to 2024 about data augmentation in Arabic text classification. Inclusion and exclusion criteria were applied to ensure a comprehensive vision of these techniques in Arabic natural language processing (ANLP). It was found that data augmentation research for Arabic text classification dominates sentiment analysis and propaganda detection, with initial studies emerging in 2019; very few studies have investigated other domains like sarcasm detection or text categorization. We also observed the lack of benchmark datasets for performing the tasks. Most studies have focused on short texts, such as Twitter data or reviews, while research on long texts still needs to be explored. Additionally, various data augmentation methods still need to be examined for long texts to determine if techniques effective for short texts are also applicable to longer texts. A rigorous investigation and comparison of the most effective strategies is required due to the unique characteristics of the Arabic language. By doing so, we can better understand the processes involved in Arabic text classification and hence be able to select the most suitable data augmentation methods for specific tasks. This review contributes valuable insights into Arabic NLP and enriches the existing body of knowledge.

摘要

数据增强技术,即人工创建新数据的方法,已在从图像到文本数据的许多领域得到证明。数据增强方法的建立是为了处理与训练数据集稀缺或类别不平衡相关的不同问题,以提高分类器的性能。这篇综述文章研究了阿拉伯语文本的数据增强技术,特别是在文本分类领域。进行了全面的综述,以便在阿拉伯语分类的背景下对这些方法有一个简洁而全面的理解。本文的重点是2019年至2024年发表的关于阿拉伯语文本分类中数据增强的阿拉伯语研究。应用了纳入和排除标准,以确保对阿拉伯语自然语言处理(ANLP)中的这些技术有一个全面的认识。研究发现,阿拉伯语文本分类的数据增强研究主要集中在情感分析和宣传检测方面,2019年出现了初步研究;很少有研究调查其他领域,如讽刺检测或文本分类。我们还观察到执行这些任务缺乏基准数据集。大多数研究都集中在短文本上,如推特数据或评论,而对长文本的研究仍有待探索。此外,对于长文本,各种数据增强方法仍需进行研究,以确定对短文本有效的技术是否也适用于长文本。由于阿拉伯语的独特特征,需要对最有效的策略进行严格的调查和比较。通过这样做,我们可以更好地理解阿拉伯语文本分类所涉及的过程,从而能够为特定任务选择最合适的数据增强方法。这篇综述为阿拉伯语自然语言处理提供了有价值的见解,并丰富了现有的知识体系。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1146/11935767/2382a47a4b77/peerj-cs-11-2685-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1146/11935767/97f288e66621/peerj-cs-11-2685-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1146/11935767/b6bd3a6d09e1/peerj-cs-11-2685-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1146/11935767/2382a47a4b77/peerj-cs-11-2685-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1146/11935767/97f288e66621/peerj-cs-11-2685-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1146/11935767/b6bd3a6d09e1/peerj-cs-11-2685-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1146/11935767/2382a47a4b77/peerj-cs-11-2685-g003.jpg

相似文献

1
Data augmentation for Arabic text classification: a review of current methods, challenges and prospective directions.阿拉伯语文本分类中的数据增强:当前方法、挑战及未来方向综述
PeerJ Comput Sci. 2025 Mar 10;11:e2685. doi: 10.7717/peerj-cs.2685. eCollection 2025.
2
Syntactic- and morphology-based text augmentation framework for Arabic sentiment analysis.用于阿拉伯语情感分析的基于句法和形态学的文本增强框架。
PeerJ Comput Sci. 2021 Apr 5;7:e469. doi: 10.7717/peerj-cs.469. eCollection 2021.
3
Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers.自然语言处理中的数据增强:一种用于长文本和短文本分类器的新型文本生成方法。
Int J Mach Learn Cybern. 2023;14(1):135-150. doi: 10.1007/s13042-022-01553-3. Epub 2022 Apr 12.
4
Guide for the application of the data augmentation approach on sets of texts in Spanish for sentiment and emotion analysis.西班牙语情感分析中数据集的扩充方法应用指南。
PLoS One. 2024 Sep 26;19(9):e0310707. doi: 10.1371/journal.pone.0310707. eCollection 2024.
5
An analysis of customer perception using lexicon-based sentiment analysis of Arabic Texts framework.使用基于词汇的阿拉伯语文本情感分析框架对客户感知进行分析。
Heliyon. 2024 May 1;10(11):e30320. doi: 10.1016/j.heliyon.2024.e30320. eCollection 2024 Jun 15.
6
Arabic paraphrased parallel synthetic dataset.阿拉伯语释义平行合成数据集。
Data Brief. 2024 Oct 10;57:111004. doi: 10.1016/j.dib.2024.111004. eCollection 2024 Dec.
7
SANAD: Single-label Arabic News Articles Dataset for automatic text categorization.SANAD:用于自动文本分类的单标签阿拉伯语新闻文章数据集。
Data Brief. 2019 Jun 4;25:104076. doi: 10.1016/j.dib.2019.104076. eCollection 2019 Aug.
8
AHD: Arabic healthcare dataset.AHD:阿拉伯语医疗保健数据集。
Data Brief. 2024 Aug 22;56:110855. doi: 10.1016/j.dib.2024.110855. eCollection 2024 Oct.
9
ArabBert-LSTM: improving Arabic sentiment analysis based on transformer model and Long Short-Term Memory.阿拉伯语BERT-LSTM:基于Transformer模型和长短期记忆改进阿拉伯语情感分析
Front Artif Intell. 2024 Jul 2;7:1408845. doi: 10.3389/frai.2024.1408845. eCollection 2024.
10
A systematic literature review of hate speech identification on Arabic Twitter data: research challenges and future directions.关于阿拉伯语推特数据中仇恨言论识别的系统文献综述:研究挑战与未来方向。
PeerJ Comput Sci. 2024 Apr 2;10:e1966. doi: 10.7717/peerj-cs.1966. eCollection 2024.

引用本文的文献

1
GATmath and GATLc: Comprehensive benchmarks for evaluating Arabic large language models.GATmath和GATLc:评估阿拉伯语大语言模型的综合基准。
PLoS One. 2025 Sep 2;20(9):e0329129. doi: 10.1371/journal.pone.0329129. eCollection 2025.

本文引用的文献

1
Generative adversarial network based adaptive data augmentation for handwritten Arabic text recognition.基于生成对抗网络的自适应数据增强用于手写阿拉伯文本识别。
PeerJ Comput Sci. 2022 Jan 25;8:e861. doi: 10.7717/peerj-cs.861. eCollection 2022.
2
Text Data Augmentation for Deep Learning.用于深度学习的文本数据增强
J Big Data. 2021;8(1):101. doi: 10.1186/s40537-021-00492-0. Epub 2021 Jul 19.
3
Syntactic- and morphology-based text augmentation framework for Arabic sentiment analysis.用于阿拉伯语情感分析的基于句法和形态学的文本增强框架。
PeerJ Comput Sci. 2021 Apr 5;7:e469. doi: 10.7717/peerj-cs.469. eCollection 2021.
4
Preprocessing Arabic text on social media.社交媒体上阿拉伯语文本的预处理
Heliyon. 2021 Feb 13;7(2):e06191. doi: 10.1016/j.heliyon.2021.e06191. eCollection 2021 Feb.