Suppr超能文献

评估大语言模型(ChatGPT、Claude、DeepSeek、Gemini、Grok和Le Chat)在回答关于血液生理学的项目分析多项选择题时的准确性和可靠性。

Evaluating the Accuracy and Reliability of Large Language Models (ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat) in Answering Item-Analyzed Multiple-Choice Questions on Blood Physiology.

作者信息

Agarwal Mayank, Sharma Priyanka, Wani Pinaki

机构信息

Physiology, All India Institute of Medical Sciences, Raebareli, IND.

Physiology, School of Medical Sciences and Research, Greater Noida, IND.

出版信息

Cureus. 2025 Apr 8;17(4):e81871. doi: 10.7759/cureus.81871. eCollection 2025 Apr.

Abstract

Background Previous research has highlighted the potential of large language models (LLMs) in answering multiple-choice questions (MCQs) in medical physiology. However, their accuracy and reliability in specialized fields, such as blood physiology, remain underexplored. This study evaluates the performance of six free-to-use LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat) in solving item-analyzed MCQs on blood physiology. The findings aim to assess their suitability as educational aids. Methods This cross-sectional study at the All India Institute of Medical Sciences, Raebareli, India, involved administering a 40-item MCQ test on blood physiology to 75 first-year medical students. Item analysis utilized the Difficulty Index (DIF I), Discrimination Index (DI), and Distractor Effectiveness (DE). Internal consistency was assessed with the Kuder-Richardson 20 (KR-20) coefficient. These 40 item-analyzed MCQs were presented to six selected LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok, Le Chat) available as standalone Android applications on March 19, 2025. Three independent users accessed each LLM simultaneously, uploading the compiled MCQs in a Portable Document Format (PDF) file. Accuracy was determined as the percentage of correct responses averaged across all three users. Reliability was measured as the percentage of MCQs consistently answered correctly by LLM to all three users. Descriptive statistics were presented as mean ± standard deviation and percentages. Pearson's correlation coefficient or Spearman's rho was used to evaluate the associations between variables, with p < 0.05 considered significant. Results Item analysis confirmed the validity and reliability of the assessment tool, with a DIF I of 63.2 ± 20.4, a DI of 0.38 ± 0.20, a DE of 66.7 ± 33.3, and a KR-20 of 0.804. Among LLMs, Claude 3.7 demonstrated the highest reliable accuracy (95%), followed by DeepSeek (93%), Grok 3 beta (93%), ChatGPT (90%), Gemini 2.0 (88%), and Mistral Le Chat (70%). No significant correlations were found between LLM performance and MCQ difficulty, discrimination power, or distractor effectiveness. Conclusions The MCQ assessment tool exhibited an appropriate difficulty level, strong discriminatory power, and adequately constructed distractors. LLMs, particularly Claude, DeepSeek, and Grok, demonstrated high accuracy and reliability in solving blood physiology MCQs, supporting their role as supplementary educational tools. LLMs handled questions of varying difficulty, discrimination power, and distractor effectiveness with similar competence. However, given occasional errors, they should be used alongside traditional teaching methods and expert supervision.

摘要

背景 先前的研究突出了大语言模型(LLMs)在回答医学生理学多项选择题(MCQs)方面的潜力。然而,它们在诸如血液生理学等专业领域的准确性和可靠性仍未得到充分探索。本研究评估了六个免费使用的大语言模型(ChatGPT、Claude、DeepSeek、Gemini、Grok和Le Chat)在解决血液生理学项目分析多项选择题方面的表现。研究结果旨在评估它们作为教育辅助工具的适用性。方法 这项在印度勒克瑙市全印度医学科学研究所进行的横断面研究,对75名一年级医学生进行了一项关于血液生理学的40道多项选择题测试。项目分析采用难度指数(DIF I)、区分指数(DI)和干扰项有效性(DE)。使用库德-理查森20(KR-20)系数评估内部一致性。这40道经过项目分析的多项选择题于2025年3月19日提交给六个选定的大语言模型(ChatGPT、Claude、DeepSeek、Gemini、Grok、Le Chat),这些模型以独立的安卓应用程序形式提供。三名独立用户同时访问每个大语言模型,以便携式文档格式(PDF)文件上传汇编好的多项选择题。准确性被确定为所有三名用户正确回答的平均百分比。可靠性以大语言模型对所有三名用户一致正确回答的多项选择题的百分比来衡量。描述性统计以平均值±标准差和百分比表示。使用皮尔逊相关系数或斯皮尔曼等级相关系数来评估变量之间的关联,p < 0.05被认为具有统计学意义。结果 项目分析证实了评估工具的有效性和可靠性,难度指数为63.2 ± 20.4,区分指数为0.38 ± 0.20,干扰项有效性为66.7 ± 33.3,KR-20为0.804。在大语言模型中,Claude 3.7表现出最高的可靠准确性(95%),其次是DeepSeek(93%)、Grok 3 beta(93%)、ChatGPT(90%)、Gemini 2.0(88%)和米斯特拉尔Le Chat(70%)。未发现大语言模型的表现与多项选择题的难度、区分能力或干扰项有效性之间存在显著相关性。结论 多项选择题评估工具表现出适当的难度水平、强大的区分能力和构造合理的干扰项。大语言模型,特别是Claude、DeepSeek和Grok,在解决血液生理学多项选择题方面表现出高准确性和可靠性,支持它们作为辅助教育工具的作用。大语言模型以相似的能力处理难度、区分能力和干扰项有效性各不相同的问题。然而,鉴于偶尔会出现错误,应将它们与传统教学方法和专家监督一起使用。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验