Suppr超能文献

评估大型语言模型(ChatGPT、豆包和Gemini)在回答有关阅读障碍和计算障碍的一般问题时的质量、实用性和可靠性。 (注:原文中的DeepSeek在国内一般被称为豆包 )

Assessing the Quality, Usefulness, and Reliability of Large Language Models (ChatGPT, DeepSeek, and Gemini) in Answering General Questions Regarding Dyslexia and Dyscalculia.

作者信息

Alrubaian Abdullah

机构信息

Department of Special Education, College of Education, Qassim University, Buraydah, Saudi Arabia.

出版信息

Psychiatr Q. 2025 Jun 12. doi: 10.1007/s11126-025-10170-6.

Abstract

The current study aimed to evaluate the quality, usefulness, and reliability of three large language models (LLMs)-ChatGPT-4, DeepSeek, and Gemini-in answering general questions about specific learning disorders (SLDs), specifically dyslexia and dyscalculia. For each learning disorder subtype, 15 questions were developed through expert review of social media, forums, and professional input. Responses from the LLMs were evaluated using the Global Quality Scale (GQS) and a seven-point Likert scale to assess usefulness and reliability. Statistical analyses were conducted to compare model performance, including descriptive statistics and one-way ANOVA. Results revealed no statistically significant differences in quality or usefulness across models for both disorders. However, ChatGPT-4 demonstrated superior reliability for dyscalculia (p < 0.05), outperforming Gemini and DeepSeek. For dyslexia, DeepSeek achieved 100% maximum reliability scores, while GPT-4 and Gemini scored 60%. All models provided high-quality responses, with mean GQS scores ranging from 4.20 to 4.60 for dyslexia and 3.93 to 4.53 for dyscalculia, although variability existed in their practical utility. While LLMs show promise in delivering dyslexia and dyscalculia-related information, GPT-4's reliability for dyscalculia highlights its potential as a supplementary educational tool. Further validation by professionals remains critical.

摘要

当前的研究旨在评估三种大语言模型(LLMs)——ChatGPT-4、豆包和Gemini——在回答关于特定学习障碍(SLDs),特别是阅读障碍和计算障碍的一般问题时的质量、有用性和可靠性。对于每种学习障碍亚型,通过对社交媒体、论坛的专家审查和专业意见,提出了15个问题。使用全球质量量表(GQS)和七点李克特量表对大语言模型的回答进行评估,以评估其有用性和可靠性。进行了统计分析以比较模型性能,包括描述性统计和单因素方差分析。结果显示,两种障碍在各模型的质量或有用性方面没有统计学上的显著差异。然而,ChatGPT-4在计算障碍方面表现出更高的可靠性(p < 0.05),优于Gemini和豆包。对于阅读障碍,豆包获得了100%的最高可靠性分数,而ChatGPT-4和Gemini的得分是60%。所有模型都提供了高质量的回答,阅读障碍的平均GQS分数在4.20至4.60之间,计算障碍的平均GQS分数在3.93至4.53之间,尽管它们的实际效用存在差异。虽然大语言模型在提供与阅读障碍和计算障碍相关的信息方面显示出前景,但ChatGPT-4在计算障碍方面的可靠性突出了其作为辅助教育工具的潜力。专业人员的进一步验证仍然至关重要。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验