• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

GATmath和GATLc:评估阿拉伯语大语言模型的综合基准。

GATmath and GATLc: Comprehensive benchmarks for evaluating Arabic large language models.

作者信息

AlBallaa Safa, AlTwairesh Nora, AlSalman Abdulmalik, Alfarhood Sultan

机构信息

Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia.

Research Chair of Online Dialogue and Cultural Communication, Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia.

出版信息

PLoS One. 2025 Sep 2;20(9):e0329129. doi: 10.1371/journal.pone.0329129. eCollection 2025.

DOI:10.1371/journal.pone.0329129
PMID:40892946
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12404542/
Abstract

The evolution of Large Language Models (LLMs) has significantly advanced artificial intelligence, driving innovation across various applications. Their continued development relies on a deep understanding of their capabilities and limitations. This is achieved primarily through rigorous evaluation based on diverse datasets. However, assessing state-of-the-art models in Arabic remains a formidable challenge due to the scarcity of comprehensive benchmarks. The absence of robust evaluation tools hinders the progress and refinement of Arabic LLMs and limits their potential applications and effectiveness in real-world scenarios. In response, we introduce the GATmath (7k questions) and GATLc (9k questions), two Arabic, large-scale, and multitask reasoning and language understanding benchmarks. Derived from the General Aptitude Test (GAT) examination, each dataset covers multiple categories, demanding skills in reasoning, semantic analysis, language comprehension, and mathematical problem-solving. To the best of our knowledge, our dataset is the first comprehensive and large-scale reasoning dataset specifically tailored to the Arabic language. We conducted a comprehensive evaluation and analysis of seven prominent LLMs on our datasets. Remarkably, even the highest-performing model attained a mere 66.9% and 64.3% accuracy, underscoring the considerable challenge posed by our datasets. This outcome illustrates the intricate nature of the tasks within our datasets and highlights the substantial room for improvement in the realm of Arabic language model development.

摘要

大语言模型(LLMs)的发展显著推动了人工智能,促进了各种应用的创新。它们的持续发展依赖于对其能力和局限性的深入理解。这主要通过基于多样数据集的严格评估来实现。然而,由于缺乏全面的基准测试,评估阿拉伯语的先进模型仍然是一项艰巨的挑战。缺乏强大的评估工具阻碍了阿拉伯语大语言模型的进步和优化,并限制了它们在现实场景中的潜在应用和有效性。作为回应,我们引入了GATmath(7000个问题)和GATLc(9000个问题),这两个阿拉伯语的、大规模的多任务推理和语言理解基准测试。每个数据集都源自通用能力测试(GAT)考试,涵盖多个类别,需要推理、语义分析、语言理解和数学问题解决等技能。据我们所知,我们的数据集是第一个专门为阿拉伯语量身定制的全面且大规模的推理数据集。我们在我们的数据集上对七个著名的大语言模型进行了全面的评估和分析。值得注意的是,即使是表现最佳的模型,准确率也仅达到66.9%和64.3%,这凸显了我们的数据集所带来的巨大挑战。这一结果说明了我们数据集中任务的复杂性,并突出了阿拉伯语语言模型开发领域仍有很大的改进空间。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9b2e/12404542/10a91b28da57/pone.0329129.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9b2e/12404542/31246ba0e553/pone.0329129.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9b2e/12404542/7725ce5521bc/pone.0329129.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9b2e/12404542/92fa06671105/pone.0329129.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9b2e/12404542/67f85af2f2f3/pone.0329129.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9b2e/12404542/69b4533d1221/pone.0329129.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9b2e/12404542/7ab9880caf68/pone.0329129.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9b2e/12404542/10a91b28da57/pone.0329129.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9b2e/12404542/31246ba0e553/pone.0329129.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9b2e/12404542/7725ce5521bc/pone.0329129.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9b2e/12404542/92fa06671105/pone.0329129.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9b2e/12404542/67f85af2f2f3/pone.0329129.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9b2e/12404542/69b4533d1221/pone.0329129.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9b2e/12404542/7ab9880caf68/pone.0329129.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9b2e/12404542/10a91b28da57/pone.0329129.g007.jpg

相似文献

1
GATmath and GATLc: Comprehensive benchmarks for evaluating Arabic large language models.GATmath和GATLc:评估阿拉伯语大语言模型的综合基准。
PLoS One. 2025 Sep 2;20(9):e0329129. doi: 10.1371/journal.pone.0329129. eCollection 2025.
2
Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers.大型语言模型在数值与语义医学知识方面的表现:基于循证问答的横断面基准研究
J Med Internet Res. 2025 Jul 14;27:e64452. doi: 10.2196/64452.
3
Evaluating and Improving Syndrome Differentiation Thinking Ability in Large Language Models: Method Development Study.评估和提高大语言模型中的辨证思维能力:方法开发研究
JMIR Med Inform. 2025 Jun 20;13:e75103. doi: 10.2196/75103.
4
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
5
Using a Diverse Test Suite to Assess Large Language Models on Fast Health Care Interoperability Resources Knowledge: Comparative Analysis.使用多样化测试套件在快速医疗保健互操作性资源知识方面评估大语言模型:比较分析
J Med Internet Res. 2025 Aug 12;27:e73540. doi: 10.2196/73540.
6
Leveraging Retrieval-Augmented Large Language Models for Dietary Recommendations With Traditional Chinese Medicine's Medicine Food Homology: Algorithm Development and Validation.利用检索增强大语言模型结合中医药食同源进行饮食推荐:算法开发与验证
JMIR Med Inform. 2025 Aug 21;13:e75279. doi: 10.2196/75279.
7
Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study.用于医学问答集成学习的大语言模型协同作用:设计与评估研究
J Med Internet Res. 2025 Jul 14;27:e70080. doi: 10.2196/70080.
8
Fine-tuning medical language models for enhanced long-contextual understanding and domain expertise.微调医学语言模型以增强长上下文理解和领域专业知识。
Quant Imaging Med Surg. 2025 Jun 6;15(6):5450-5462. doi: 10.21037/qims-2024-2655. Epub 2025 Jun 3.
9
A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试,采用了适配的大语言模型。
J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.
10
Implementing Large Language Models in Health Care: Clinician-Focused Review With Interactive Guideline.在医疗保健中应用大语言模型:以临床医生为重点的回顾与交互式指南
J Med Internet Res. 2025 Jul 11;27:e71916. doi: 10.2196/71916.

本文引用的文献

1
Advancing sentiment analysis for low-resourced african languages using pre-trained language models.使用预训练语言模型推进低资源非洲语言的情感分析。
PLoS One. 2025 Jun 5;20(6):e0325102. doi: 10.1371/journal.pone.0325102. eCollection 2025.
2
Detecting gender bias in Arabic text through word embeddings.通过词嵌入检测阿拉伯语文本中的性别偏见。
PLoS One. 2025 Mar 31;20(3):e0319301. doi: 10.1371/journal.pone.0319301. eCollection 2025.
3
Data augmentation for Arabic text classification: a review of current methods, challenges and prospective directions.
阿拉伯语文本分类中的数据增强:当前方法、挑战及未来方向综述
PeerJ Comput Sci. 2025 Mar 10;11:e2685. doi: 10.7717/peerj-cs.2685. eCollection 2025.
4
Code-mixing unveiled: Enhancing the hate speech detection in Arabic dialect tweets using machine learning models.代码混合揭秘:使用机器学习模型增强阿拉伯方言推文中的仇恨言论检测
PLoS One. 2024 Jul 17;19(7):e0305657. doi: 10.1371/journal.pone.0305657. eCollection 2024.
5
A bilingual benchmark for evaluating large language models.一个用于评估大语言模型的双语基准。
PeerJ Comput Sci. 2024 Feb 29;10:e1893. doi: 10.7717/peerj-cs.1893. eCollection 2024.
6
Large language model (LLM)-driven chatbots for neuro-ophthalmic medical education.用于神经眼科医学教育的大语言模型(LLM)驱动的聊天机器人。
Eye (Lond). 2024 Mar;38(4):639-641. doi: 10.1038/s41433-023-02759-7. Epub 2023 Sep 25.
7
Large language models in medicine.医学中的大型语言模型。
Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.