Hamdy Abdelrahman, Youssef Ayman, Ryan Conor
The Open University, Milton Keynes, United Kingdom.
Department of Computers and Systems, Electronics Research Institute, Cairo, Egypt.
PLoS One. 2025 Aug 29;20(8):e0328369. doi: 10.1371/journal.pone.0328369. eCollection 2025.
The analysis of Arabic Twitter data sets is a highly active research topic, particularly since the outbreak of COVID-19 and subsequent attempts to understand public sentiment related to the pandemic. This activity is partially driven by the high number of Arabic Twitter users, around 164 million. Word embedding models are a vital tool for analysing Twitter data sets, as they are considered one of the essential methods of transforming words into numbers that can be processed using machine learning (ML) algorithms. In this work, we introduce a new model, Arab2Vec, that can be used in Twitter-based natural language processing (NLP) applications. Arab2Vec was constructed using a vast data set of approximately 186,000,000 tweets from 2008 to 2021 from all Arabic Twitter sources. This makes Arab2Vec the most up-to-date word embedding model researchers can use for Twitter-based applications. The model is compared with existing models from the literature. The reported results demonstrate superior performance regarding the number of recognised words and F1 score for classification tasks with known data sets and the ability to work with emojis. We also incorporate skip-grams with negative sampling, an approach that other Arabic models haven't previously used. Nine versions of Arab2Vec are produced; these models differ regarding available features, the number of words trained on, speed, etc. This paper provides Arab2Vec as an open-source project for users to employ in research. It describes the data collection methods, the data pre-processing and cleaning step, the effort to build these nine models, and experiments to validate them qualitatively and quantitatively.
阿拉伯语推特数据集的分析是一个高度活跃的研究课题,尤其是自新冠疫情爆发以及随后人们试图了解与该疫情相关的公众情绪以来。这一活动部分是由大量阿拉伯语推特用户推动的,大约有1.64亿用户。词嵌入模型是分析推特数据集的重要工具,因为它们被认为是将单词转化为可用机器学习(ML)算法处理的数字的基本方法之一。在这项工作中,我们引入了一种新模型Arab2Vec,可用于基于推特的自然语言处理(NLP)应用。Arab2Vec是使用2008年至2021年来自所有阿拉伯语推特来源的约1.86亿条推文的大量数据集构建的。这使得Arab2Vec成为研究人员可用于基于推特应用的最新词嵌入模型。该模型与文献中的现有模型进行了比较。报告结果表明,在已知数据集的分类任务中,在识别单词数量和F1分数方面以及处理表情符号的能力方面,该模型具有卓越的性能。我们还纳入了带负采样的跳字模型,这是其他阿拉伯语模型以前未曾使用过的方法。我们生成了九个版本的Arab2Vec;这些模型在可用特征、训练的单词数量、速度等方面存在差异。本文将Arab2Vec作为一个开源项目提供给用户用于研究。它描述了数据收集方法、数据预处理和清理步骤、构建这九个模型的工作以及对它们进行定性和定量验证的实验。