评估大型语言模型（ChatGPT、豆包和Gemini）在回答有关阅读障碍和计算障碍的一般问题时的质量、实用性和可靠性。（注：原文中的DeepSeek在国内一般被称为豆包）

Assessing the Quality, Usefulness, and Reliability of Large Language Models (ChatGPT, DeepSeek, and Gemini) in Answering General Questions Regarding Dyslexia and Dyscalculia.

作者信息

Alrubaian Abdullah

机构信息

Department of Special Education, College of Education, Qassim University, Buraydah, Saudi Arabia.

出版信息

Psychiatr Q. 2025 Jun 12. doi: 10.1007/s11126-025-10170-6.

DOI:10.1007/s11126-025-10170-6

PMID:40504301

Abstract

The current study aimed to evaluate the quality, usefulness, and reliability of three large language models (LLMs)-ChatGPT-4, DeepSeek, and Gemini-in answering general questions about specific learning disorders (SLDs), specifically dyslexia and dyscalculia. For each learning disorder subtype, 15 questions were developed through expert review of social media, forums, and professional input. Responses from the LLMs were evaluated using the Global Quality Scale (GQS) and a seven-point Likert scale to assess usefulness and reliability. Statistical analyses were conducted to compare model performance, including descriptive statistics and one-way ANOVA. Results revealed no statistically significant differences in quality or usefulness across models for both disorders. However, ChatGPT-4 demonstrated superior reliability for dyscalculia (p < 0.05), outperforming Gemini and DeepSeek. For dyslexia, DeepSeek achieved 100% maximum reliability scores, while GPT-4 and Gemini scored 60%. All models provided high-quality responses, with mean GQS scores ranging from 4.20 to 4.60 for dyslexia and 3.93 to 4.53 for dyscalculia, although variability existed in their practical utility. While LLMs show promise in delivering dyslexia and dyscalculia-related information, GPT-4's reliability for dyscalculia highlights its potential as a supplementary educational tool. Further validation by professionals remains critical.

摘要

当前的研究旨在评估三种大语言模型（LLMs）——ChatGPT-4、豆包和Gemini——在回答关于特定学习障碍（SLDs），特别是阅读障碍和计算障碍的一般问题时的质量、有用性和可靠性。对于每种学习障碍亚型，通过对社交媒体、论坛的专家审查和专业意见，提出了15个问题。使用全球质量量表（GQS）和七点李克特量表对大语言模型的回答进行评估，以评估其有用性和可靠性。进行了统计分析以比较模型性能，包括描述性统计和单因素方差分析。结果显示，两种障碍在各模型的质量或有用性方面没有统计学上的显著差异。然而，ChatGPT-4在计算障碍方面表现出更高的可靠性（p < 0.05），优于Gemini和豆包。对于阅读障碍，豆包获得了100%的最高可靠性分数，而ChatGPT-4和Gemini的得分是60%。所有模型都提供了高质量的回答，阅读障碍的平均GQS分数在4.20至4.60之间，计算障碍的平均GQS分数在3.93至4.53之间，尽管它们的实际效用存在差异。虽然大语言模型在提供与阅读障碍和计算障碍相关的信息方面显示出前景，但ChatGPT-4在计算障碍方面的可靠性突出了其作为辅助教育工具的潜力。专业人员的进一步验证仍然至关重要。

相似文献

Assessing the Quality, Usefulness, and Reliability of Large Language Models (ChatGPT, DeepSeek, and Gemini) in Answering General Questions Regarding Dyslexia and Dyscalculia.评估大型语言模型（ChatGPT、豆包和Gemini）在回答有关阅读障碍和计算障碍的一般问题时的质量、实用性和可靠性。（注：原文中的DeepSeek在国内一般被称为豆包）

Psychiatr Q. 2025 Jun 12. doi: 10.1007/s11126-025-10170-6.

Evaluating the Accuracy and Reliability of Large Language Models (ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat) in Answering Item-Analyzed Multiple-Choice Questions on Blood Physiology.评估大语言模型（ChatGPT、Claude、DeepSeek、Gemini、Grok和Le Chat）在回答关于血液生理学的项目分析多项选择题时的准确性和可靠性。

Cureus. 2025 Apr 8;17(4):e81871. doi: 10.7759/cureus.81871. eCollection 2025 Apr.

Artificial intelligence performance in answering multiple-choice oral pathology questions: a comparative analysis.人工智能在回答口腔病理学选择题方面的表现：一项对比分析。

BMC Oral Health. 2025 Apr 15;25(1):573. doi: 10.1186/s12903-025-05926-2.

Enhancing responses from large language models with role-playing prompts: a comparative study on answering frequently asked questions about total knee arthroplasty.通过角色扮演提示增强大语言模型的回答：关于全膝关节置换术常见问题解答的比较研究

BMC Med Inform Decis Mak. 2025 May 23;25(1):196. doi: 10.1186/s12911-025-03024-5.

DeepSeek vs ChatGPT: a comparison study of their performance in answering prostate cancer radiotherapy questions in multiple languages.深度搜索与ChatGPT：它们在以多种语言回答前列腺癌放射治疗问题方面的性能比较研究。

Am J Clin Exp Urol. 2025 Apr 25;13(2):176-185. doi: 10.62347/UIAP7979. eCollection 2025.

Comparing the performance of ChatGPT 4o, DeepSeek R1, and Gemini 2 Pro in answering fixed prosthodontics questions over time.比较ChatGPT 4o、DeepSeek R1和Gemini 2 Pro在不同时间回答固定义齿修复问题方面的表现。

J Prosthet Dent. 2025 May 22. doi: 10.1016/j.prosdent.2025.04.038.

Evaluating Artificial Intelligence in Patient Education: DeepSeek-V3 Versus ChatGPT-4o in Answering Common Questions on Laparoscopic Cholecystectomy.评估患者教育中的人工智能：DeepSeek-V3与ChatGPT-4o在回答腹腔镜胆囊切除术常见问题方面的比较

ANZ J Surg. 2025 Jun 11. doi: 10.1111/ans.70198.

Performance of the ChatGPT-3.5, ChatGPT-4, and Google Gemini large language models in responding to dental implantology inquiries.ChatGPT-3.5、ChatGPT-4和谷歌Gemini大型语言模型在回答牙种植学相关问题方面的表现。

J Prosthet Dent. 2025 Jan 4. doi: 10.1016/j.prosdent.2024.12.016.

Performance of three artificial intelligence (AI)-based large language models in standardized testing; implications for AI-assisted dental education.三种基于人工智能（AI）的大语言模型在标准化测试中的表现；对人工智能辅助牙科教育的启示。

J Periodontal Res. 2025 Feb;60(2):121-133. doi: 10.1111/jre.13323. Epub 2024 Jul 18.

Evaluating the reliability of the responses of large language models to keratoconus-related questions.评估大语言模型对圆锥角膜相关问题回答的可靠性。

Clin Exp Optom. 2024 Oct 24:1-8. doi: 10.1080/08164622.2024.2419524.

本文引用的文献

Chat GPT, Gemini or Meta AI: A comparison of AI platforms as a tool for answering higher-order questions in microbiology.Chat GPT、Gemini 还是 Meta AI：人工智能平台作为回答微生物学高阶问题工具的比较

J Postgrad Med. 2025 Jan 1;71(1):28-32. doi: 10.4103/jpgm.jpgm_775_24. Epub 2025 Mar 19.

Evaluation of the reliability, usefulness, quality and readability of ChatGPT's responses on Scoliosis.评估ChatGPT对脊柱侧弯问题回答的可靠性、实用性、质量和可读性。

Eur J Orthop Surg Traumatol. 2025 Mar 18;35(1):123. doi: 10.1007/s00590-025-04198-4.

ChatGPT-4.0 vs. Google: Which Provides More Academic Answers to Patients' Questions on Arthroscopic Meniscus Repair?ChatGPT-4.0与谷歌：哪一个能为患者关于关节镜半月板修复的问题提供更多学术性答案？

Cureus. 2024 Dec 25;16(12):e76380. doi: 10.7759/cureus.76380. eCollection 2024 Dec.

Comparison of artificial intelligence large language model chatbots in answering frequently asked questions in anaesthesia.人工智能大语言模型聊天机器人在回答麻醉常见问题方面的比较。

BJA Open. 2024 May 8;10:100280. doi: 10.1016/j.bjao.2024.100280. eCollection 2024 Jun.

Evaluation of the reliability and readability of ChatGPT-4 responses regarding hypothyroidism during pregnancy.评估 ChatGPT-4 在妊娠期间甲状腺功能减退症相关问题的回复的可靠性和可读性。

Sci Rep. 2024 Jan 2;14(1):243. doi: 10.1038/s41598-023-50884-w.

"Dr ChatGPT": Is it a reliable and useful source for common rheumatic diseases?“ChatGPT 医生”：它是常见风湿病的可靠且有用的信息来源吗？

Int J Rheum Dis. 2023 Jul;26(7):1343-1349. doi: 10.1111/1756-185X.14749. Epub 2023 May 23.

Prevalence of Specific Learning Disorders (SLD) Among Children in India: A Systematic Review and Meta-Analysis.印度儿童特定学习障碍（SLD）的患病率：一项系统评价与荟萃分析

Indian J Psychol Med. 2023 May;45(3):213-219. doi: 10.1177/02537176221100128. Epub 2022 Jun 26.

Evaluating the readability, quality and reliability of online patient education materials on transcutaneuous electrical nerve stimulation (TENS).评估经皮神经电刺激（TENS）在线患者教育材料的可读性、质量和可靠性。

Medicine (Baltimore). 2023 Apr 21;102(16):e33529. doi: 10.1097/MD.0000000000033529.

Role of Chat GPT in Public Health.Chat GPT 在公共卫生中的作用。

Ann Biomed Eng. 2023 May;51(5):868-869. doi: 10.1007/s10439-023-03172-7. Epub 2023 Mar 15.

Behavioral and neurophysiological aspects of working memory impairment in children with dyslexia.阅读障碍儿童工作记忆损伤的行为和神经生理学方面。

Sci Rep. 2022 Jul 22;12(1):12571. doi: 10.1038/s41598-022-16729-8.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

评估大型语言模型（ChatGPT、豆包和Gemini）在回答有关阅读障碍和计算障碍的一般问题时的质量、实用性和可靠性。 （注：原文中的DeepSeek在国内一般被称为豆包 ）

Assessing the Quality, Usefulness, and Reliability of Large Language Models (ChatGPT, DeepSeek, and Gemini) in Answering General Questions Regarding Dyslexia and Dyscalculia.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献

评估大型语言模型（ChatGPT、豆包和Gemini）在回答有关阅读障碍和计算障碍的一般问题时的质量、实用性和可靠性。（注：原文中的DeepSeek在国内一般被称为豆包）