• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估大型语言模型作为医学简答题评分者:与专家人工评分者的比较分析。

Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders.

作者信息

Bolgova Olena, Ganguly Paul, Ikram Muhammad Faisal, Mavrych Volodymyr

机构信息

College of Medicine, Alfaisal University, Riyadh, Kingdom of Saudi Arabia.

出版信息

Med Educ Online. 2025 Dec;30(1):2550751. doi: 10.1080/10872981.2025.2550751. Epub 2025 Aug 24.

DOI:10.1080/10872981.2025.2550751
PMID:40849930
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12377152/
Abstract

The assessment of short-answer questions (SAQs) in medical education is resource-intensive, requiring significant expert time. Large Language Models (LLMs) offer potential for automating this process, but their efficacy in specialized medical education assessment remains understudied. To evaluate the capability of five LLMs to grade medical SAQs compared to expert human graders across four distinct medical disciplines. This study analyzed 804 student responses across anatomy, histology, embryology, and physiology. Three faculty members graded all responses. Five LLMs (GPT-4.1, Gemini, Claude, Copilot, DeepSeek) evaluated responses twice: first using their learned representations to generate their own grading criteria (A1), then using expert-provided rubrics (A2). Agreement was measured using Cohen's Kappa and Intraclass Correlation Coefficient (ICC). Expert-expert agreement was substantial across all questions (average Kappa: 0.69, ICC: 0.86), ranging from moderate (SAQ2: 0.57) to almost perfect (SAQ4: 0.87). LLM performance varied dramatically by question type and model. The highest expert-LLM agreement was observed for Claude on SAQ3 (Kappa: 0.61) and DeepSeek on SAQ2 (Kappa: 0.53). Providing expert criteria had inconsistent effects, significantly improving some model-question combinations while decreasing others. No single LLM consistently outperformed others across all domains. LLM strictness in grading unsatisfactory responses varied substantially from experts. LLMs demonstrated domain-specific variations in grading capabilities. The provision of expert criteria did not consistently improve performance. While LLMs show promise for supporting medical education assessment, their implementation requires domain-specific considerations and continued human oversight.

摘要

医学教育中简答题(SAQs)的评估资源消耗大,需要专家投入大量时间。大语言模型(LLMs)为实现这一过程的自动化提供了可能,但其在专业医学教育评估中的效果仍有待研究。为了评估五个大语言模型对四个不同医学学科的医学简答题进行评分的能力,并与专家人工评分进行比较。本研究分析了来自解剖学、组织学、胚胎学和生理学的804名学生的回答。三名教员对所有回答进行评分。五个大语言模型(GPT-4.1、Gemini、Claude、Copilot、DeepSeek)对回答进行了两次评估:第一次使用其学习到的表征来生成自己的评分标准(A1),然后使用专家提供的评分细则(A2)。使用科恩卡方系数和组内相关系数(ICC)来衡量一致性。所有问题上专家之间的一致性都很高(平均卡方系数:0.69,ICC:0.86),范围从中度(SAQ2:0.57)到几乎完美(SAQ4:0.87)。大语言模型的表现因问题类型和模型而有很大差异。在SAQ3上,Claude的专家与大语言模型一致性最高(卡方系数:0.61),在SAQ2上,DeepSeek的一致性最高(卡方系数:0.53)。提供专家标准的效果不一致,显著提高了一些模型 - 问题组合的一致性,同时降低了其他组合的一致性。没有一个大语言模型在所有领域都始终优于其他模型。大语言模型对不满意回答的评分严格程度与专家有很大不同。大语言模型在评分能力上表现出特定领域的差异。提供专家标准并没有持续提高性能。虽然大语言模型在支持医学教育评估方面有前景,但其实施需要特定领域的考虑和持续的人工监督。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d15f/12377152/8ef0b57256a3/ZMEO_A_2550751_F0002_OC.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d15f/12377152/082a8b248439/ZMEO_A_2550751_F0001_OC.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d15f/12377152/8ef0b57256a3/ZMEO_A_2550751_F0002_OC.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d15f/12377152/082a8b248439/ZMEO_A_2550751_F0001_OC.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d15f/12377152/8ef0b57256a3/ZMEO_A_2550751_F0002_OC.jpg

相似文献

1
Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders.评估大型语言模型作为医学简答题评分者:与专家人工评分者的比较分析。
Med Educ Online. 2025 Dec;30(1):2550751. doi: 10.1080/10872981.2025.2550751. Epub 2025 Aug 24.
2
Large language models in medical education: a comparative cross-platform evaluation in answering histological questions.医学教育中的大语言模型:回答组织学问题的比较性跨平台评估
Med Educ Online. 2025 Dec;30(1):2534065. doi: 10.1080/10872981.2025.2534065. Epub 2025 Jul 12.
3
Comparative analysis of LLMs performance in medical embryology: A cross-platform study of ChatGPT, Claude, Gemini, and Copilot.大语言模型在医学胚胎学中的性能比较分析:ChatGPT、Claude、Gemini和Copilot的跨平台研究
Anat Sci Educ. 2025 May 11. doi: 10.1002/ase.70044.
4
Large language models (LLMs) in radiology exams for medical students: Performance and consequences.面向医学生的放射学考试中的大语言模型:表现与影响。
Rofo. 2024 Nov 4. doi: 10.1055/a-2437-2067.
5
A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection.对大语言模型生成的尸体臂丛神经解剖分步指导的结构化评估。
BMC Med Educ. 2025 Jul 1;25(1):903. doi: 10.1186/s12909-025-07493-0.
6
Performance of Large Language Models in Numerical Versus Semantic Medical Knowledge: Cross-Sectional Benchmarking Study on Evidence-Based Questions and Answers.大型语言模型在数值与语义医学知识方面的表现:基于循证问答的横断面基准研究
J Med Internet Res. 2025 Jul 14;27:e64452. doi: 10.2196/64452.
7
A multi-dimensional performance evaluation of large language models in dental implantology: comparison of ChatGPT, DeepSeek, Grok, Gemini and Qwen across diverse clinical scenarios.牙种植学中大型语言模型的多维性能评估:ChatGPT、百川智能、Grok、Gemini和通义千问在不同临床场景下的比较
BMC Oral Health. 2025 Jul 28;25(1):1272. doi: 10.1186/s12903-025-06619-6.
8
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响:比较案例研究
JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.
9
Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions.大型语言模型在外科检查问题中的视觉能力基准测试
J Surg Educ. 2025 Apr;82(4):103442. doi: 10.1016/j.jsurg.2025.103442. Epub 2025 Feb 9.
10
Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents.大型语言模型在欧洲普通外科医师资格考试中表现不佳:与专家及外科住院医师的比较研究
BMC Med Educ. 2025 Aug 23;25(1):1193. doi: 10.1186/s12909-025-07856-7.

本文引用的文献

1
Large language models in medical education: a comparative cross-platform evaluation in answering histological questions.医学教育中的大语言模型:回答组织学问题的比较性跨平台评估
Med Educ Online. 2025 Dec;30(1):2534065. doi: 10.1080/10872981.2025.2534065. Epub 2025 Jul 12.
2
Performance of ChatGPT-3.5 and ChatGPT-4 in Solving Questions Based on Core Concepts in Cardiovascular Physiology.ChatGPT-3.5和ChatGPT-4在基于心血管生理学核心概念解决问题方面的表现。
Cureus. 2025 May 6;17(5):e83552. doi: 10.7759/cureus.83552. eCollection 2025 May.
3
Comparative analysis of LLMs performance in medical embryology: A cross-platform study of ChatGPT, Claude, Gemini, and Copilot.
大语言模型在医学胚胎学中的性能比较分析:ChatGPT、Claude、Gemini和Copilot的跨平台研究
Anat Sci Educ. 2025 May 11. doi: 10.1002/ase.70044.
4
Large Language Models in Biochemistry Education: Comparative Evaluation of Performance.生物化学教育中的大语言模型:性能的比较评估
JMIR Med Educ. 2025 Apr 10;11:e67244. doi: 10.2196/67244.
5
Claude, ChatGPT, Copilot, and Gemini performance versus students in different topics of neuroscience.克劳德、ChatGPT、Copilot和Gemini在神经科学不同主题上与学生的表现对比。
Adv Physiol Educ. 2025 Jun 1;49(2):430-437. doi: 10.1152/advan.00093.2024. Epub 2025 Jan 17.
6
Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis.在大体解剖学课程中使用大语言模型(ChatGPT、Copilot、PaLM、Bard和Gemini):比较分析
Clin Anat. 2025 Mar;38(2):200-210. doi: 10.1002/ca.24244. Epub 2024 Nov 21.
7
LLM-based automatic short answer grading in undergraduate medical education.基于 LLM 的本科医学教育自动简答题评分。
BMC Med Educ. 2024 Sep 27;24(1):1060. doi: 10.1186/s12909-024-06026-5.
8
Response Generated by Large Language Models Depends on the Structure of the Prompt.大语言模型生成的回复取决于提示的结构。
Indian J Radiol Imaging. 2024 Mar 25;34(3):574-575. doi: 10.1055/s-0044-1782165. eCollection 2024 Jul.
9
Development and validation of immediate self-feedback very short answer questions for medical students: practical implementation of generalizability theory to estimate reliability in formative examination designs.发展和验证医学生即时自我反馈简答题:应用概化理论估计形成性考试设计中的可靠性的实际操作。
BMC Med Educ. 2024 May 24;24(1):572. doi: 10.1186/s12909-024-05569-x.
10
Artificial intelligence and medical education: application in classroom instruction and student assessment using a pharmacology & therapeutics case study.人工智能与医学教育:在药理学与治疗学案例研究中的课堂教学及学生评估应用
BMC Med Educ. 2024 Apr 22;24(1):431. doi: 10.1186/s12909-024-05365-7.