AlBallaa Safa, AlTwairesh Nora, AlSalman Abdulmalik, Alfarhood Sultan
Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia.
Research Chair of Online Dialogue and Cultural Communication, Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia.
PLoS One. 2025 Sep 2;20(9):e0329129. doi: 10.1371/journal.pone.0329129. eCollection 2025.
The evolution of Large Language Models (LLMs) has significantly advanced artificial intelligence, driving innovation across various applications. Their continued development relies on a deep understanding of their capabilities and limitations. This is achieved primarily through rigorous evaluation based on diverse datasets. However, assessing state-of-the-art models in Arabic remains a formidable challenge due to the scarcity of comprehensive benchmarks. The absence of robust evaluation tools hinders the progress and refinement of Arabic LLMs and limits their potential applications and effectiveness in real-world scenarios. In response, we introduce the GATmath (7k questions) and GATLc (9k questions), two Arabic, large-scale, and multitask reasoning and language understanding benchmarks. Derived from the General Aptitude Test (GAT) examination, each dataset covers multiple categories, demanding skills in reasoning, semantic analysis, language comprehension, and mathematical problem-solving. To the best of our knowledge, our dataset is the first comprehensive and large-scale reasoning dataset specifically tailored to the Arabic language. We conducted a comprehensive evaluation and analysis of seven prominent LLMs on our datasets. Remarkably, even the highest-performing model attained a mere 66.9% and 64.3% accuracy, underscoring the considerable challenge posed by our datasets. This outcome illustrates the intricate nature of the tasks within our datasets and highlights the substantial room for improvement in the realm of Arabic language model development.
大语言模型(LLMs)的发展显著推动了人工智能,促进了各种应用的创新。它们的持续发展依赖于对其能力和局限性的深入理解。这主要通过基于多样数据集的严格评估来实现。然而,由于缺乏全面的基准测试,评估阿拉伯语的先进模型仍然是一项艰巨的挑战。缺乏强大的评估工具阻碍了阿拉伯语大语言模型的进步和优化,并限制了它们在现实场景中的潜在应用和有效性。作为回应,我们引入了GATmath(7000个问题)和GATLc(9000个问题),这两个阿拉伯语的、大规模的多任务推理和语言理解基准测试。每个数据集都源自通用能力测试(GAT)考试,涵盖多个类别,需要推理、语义分析、语言理解和数学问题解决等技能。据我们所知,我们的数据集是第一个专门为阿拉伯语量身定制的全面且大规模的推理数据集。我们在我们的数据集上对七个著名的大语言模型进行了全面的评估和分析。值得注意的是,即使是表现最佳的模型,准确率也仅达到66.9%和64.3%,这凸显了我们的数据集所带来的巨大挑战。这一结果说明了我们数据集中任务的复杂性,并突出了阿拉伯语语言模型开发领域仍有很大的改进空间。