Bolgova Olena, Ganguly Paul, Ikram Muhammad Faisal, Mavrych Volodymyr
College of Medicine, Alfaisal University, Riyadh, Kingdom of Saudi Arabia.
Med Educ Online. 2025 Dec;30(1):2550751. doi: 10.1080/10872981.2025.2550751. Epub 2025 Aug 24.
The assessment of short-answer questions (SAQs) in medical education is resource-intensive, requiring significant expert time. Large Language Models (LLMs) offer potential for automating this process, but their efficacy in specialized medical education assessment remains understudied. To evaluate the capability of five LLMs to grade medical SAQs compared to expert human graders across four distinct medical disciplines. This study analyzed 804 student responses across anatomy, histology, embryology, and physiology. Three faculty members graded all responses. Five LLMs (GPT-4.1, Gemini, Claude, Copilot, DeepSeek) evaluated responses twice: first using their learned representations to generate their own grading criteria (A1), then using expert-provided rubrics (A2). Agreement was measured using Cohen's Kappa and Intraclass Correlation Coefficient (ICC). Expert-expert agreement was substantial across all questions (average Kappa: 0.69, ICC: 0.86), ranging from moderate (SAQ2: 0.57) to almost perfect (SAQ4: 0.87). LLM performance varied dramatically by question type and model. The highest expert-LLM agreement was observed for Claude on SAQ3 (Kappa: 0.61) and DeepSeek on SAQ2 (Kappa: 0.53). Providing expert criteria had inconsistent effects, significantly improving some model-question combinations while decreasing others. No single LLM consistently outperformed others across all domains. LLM strictness in grading unsatisfactory responses varied substantially from experts. LLMs demonstrated domain-specific variations in grading capabilities. The provision of expert criteria did not consistently improve performance. While LLMs show promise for supporting medical education assessment, their implementation requires domain-specific considerations and continued human oversight.
医学教育中简答题(SAQs)的评估资源消耗大,需要专家投入大量时间。大语言模型(LLMs)为实现这一过程的自动化提供了可能,但其在专业医学教育评估中的效果仍有待研究。为了评估五个大语言模型对四个不同医学学科的医学简答题进行评分的能力,并与专家人工评分进行比较。本研究分析了来自解剖学、组织学、胚胎学和生理学的804名学生的回答。三名教员对所有回答进行评分。五个大语言模型(GPT-4.1、Gemini、Claude、Copilot、DeepSeek)对回答进行了两次评估:第一次使用其学习到的表征来生成自己的评分标准(A1),然后使用专家提供的评分细则(A2)。使用科恩卡方系数和组内相关系数(ICC)来衡量一致性。所有问题上专家之间的一致性都很高(平均卡方系数:0.69,ICC:0.86),范围从中度(SAQ2:0.57)到几乎完美(SAQ4:0.87)。大语言模型的表现因问题类型和模型而有很大差异。在SAQ3上,Claude的专家与大语言模型一致性最高(卡方系数:0.61),在SAQ2上,DeepSeek的一致性最高(卡方系数:0.53)。提供专家标准的效果不一致,显著提高了一些模型 - 问题组合的一致性,同时降低了其他组合的一致性。没有一个大语言模型在所有领域都始终优于其他模型。大语言模型对不满意回答的评分严格程度与专家有很大不同。大语言模型在评分能力上表现出特定领域的差异。提供专家标准并没有持续提高性能。虽然大语言模型在支持医学教育评估方面有前景,但其实施需要特定领域的考虑和持续的人工监督。