Alsuwaylimi Amjad A
Department of Information Technology, Faculty of Computing and Information Technology, Northern Border University, Rafha, 91911, Saudi Arabia.
Heliyon. 2024 Aug 13;10(17):e36280. doi: 10.1016/j.heliyon.2024.e36280. eCollection 2024 Sep 15.
Arabic Dialect Identification (ADI) is a challenging task in natural language processing applications due to its diversity and regional variations. Despite previous efforts, this task is still difficult. Therefore, this study aims to use transformers to address the issue of ADI on social media. A combination of two hybrid models is proposed in this study: one that combines Bidirectional Long Short-Term Memory (BiLSTM) with CAMeLBERT, and the second model that combines the BiLSTM model with AlBERT. In addition, a novel dataset comprising 121,289 user-generated comments from various social media network platforms and four major Arabic dialects (Egyptian, Jordanian, Gulf and Yemeni) was introduced. Several experiments have been conducted using conventional Machine Learning Classifiers (MLCs) and Deep Learning Models (DLMs) as baselines to measure the performance and effectiveness of the proposed models. In addition, binary classification is performed between two dialects to determine which are closest to each other. The performance of the model is measured using common metrics such as precision, recall, F-score and F-measure. Experiment results demonstrate the superior efficiency of the proposed hybrid models in ADI, CAMeLBERT with BiLSTM and ALBERT with BiLSTM, which both recorded an accuracy of 87.67 % and 86.51 %, respectively.
阿拉伯方言识别(ADI)在自然语言处理应用中是一项具有挑战性的任务,因为其具有多样性和地域差异。尽管此前已做出诸多努力,但这项任务仍然困难重重。因此,本研究旨在使用Transformer来解决社交媒体上的阿拉伯方言识别问题。本研究提出了两种混合模型的组合:一种是将双向长短期记忆(BiLSTM)与CAMeLBERT相结合,另一种是将BiLSTM模型与阿尔伯特(ALBERT)相结合。此外,还引入了一个新颖的数据集,该数据集包含来自各种社交媒体网络平台的121,289条用户生成的评论以及四种主要阿拉伯方言(埃及语、约旦语、海湾阿拉伯语和也门语)。已使用传统机器学习分类器(MLC)和深度学习模型(DLM)作为基线进行了多项实验,以衡量所提出模型的性能和有效性。此外,还在两种方言之间进行了二元分类,以确定哪两种方言彼此最接近。使用精度、召回率、F分数和F测度等常见指标来衡量模型的性能。实验结果表明,所提出的混合模型在阿拉伯方言识别方面具有卓越的效率,即BiLSTM与CAMeLBERT以及BiLSTM与ALBERT的组合,其准确率分别达到了87.67%和86.51%。