Krumsvik Rune Johan
Department of Education, University of Bergen, Bergen, Norway.
Front Med (Lausanne). 2025 Apr 10;12:1441747. doi: 10.3389/fmed.2025.1441747. eCollection 2025.
The growing integration of artificial intelligence (AI) in education has paved the way for innovative assessment methods. This study explores the capabilities of GPT-4, which is a large language model (LLM), on a medicine exam and for formative and summative assessments in Norwegian educational settings. This research builds on our previous work to explore how AI, specifically GPT-4, can enhance assessment practices by evaluating its performance on a full-scale medical multiple-choice exam. Prior studies have revealed that LLM's can have certain potential in medical education but have not specifically examined how GPT-4 can enhance formative and summative assessments in medical education. Therefore, my study contributes to filling gaps in the current knowledge by examining GPT-4's capabilities for formative and summative assessment in medical education in Norway. For this purpose, a case study design was employed, and the primary data sources were 110 exam questions, 10 blinded exam questions, and 2 patient cases within medicine. The findings from this intrinsic case study revealed that GPT-4 performed well on the summative assessment, with a robust handling of the Norwegian medical language. Further, GPT-4 demonstrated a reliable evaluation of comprehensive student exams, such as patient cases, and, thus, aligned closely with human assessments. The findings suggest that GPT-4 can improve formative assessment by providing timely, personalized feedback to support student learning. This study highlights the importance of both an empirical and theoretical understanding of the gap between traditional assessment methods and educational practices and AI-enhanced approaches-particularly the importance of the ability of chain-of-thought prompting, how AI can scaffold tutoring, and assessment practices. However, continuous refinement and human oversight remain crucial to ensure the effective and responsible integration of LLM's like GPT-4 into educational settings.
人工智能(AI)在教育领域日益深入的融合为创新评估方法铺平了道路。本研究探讨了大型语言模型(LLM)GPT-4在挪威教育环境下的医学考试以及形成性和总结性评估中的能力。本研究基于我们之前的工作,旨在通过评估GPT-4在全面的医学多项选择题考试中的表现,探索人工智能,特别是GPT-4如何能够改进评估实践。先前的研究表明,大型语言模型在医学教育中具有一定潜力,但尚未具体研究GPT-4如何能够加强医学教育中的形成性和总结性评估。因此,我的研究通过考察GPT-4在挪威医学教育中的形成性和总结性评估能力,为填补当前知识空白做出了贡献。为此,采用了案例研究设计,主要数据来源包括110道考试题目、10道盲评考试题目以及医学领域的2个患者病例。这项内在案例研究的结果显示,GPT-4在总结性评估中表现出色,能够很好地处理挪威医学语言。此外,GPT-4对综合学生考试(如患者病例)表现出可靠的评估能力,因此与人工评估高度一致。研究结果表明,GPT-4可以通过提供及时、个性化的反馈来支持学生学习,从而改进形成性评估。本研究强调了对传统评估方法与教育实践以及人工智能增强方法之间差距进行实证和理论理解的重要性,特别是思维链提示能力、人工智能如何辅助辅导以及评估实践的重要性。然而,持续完善和人工监督对于确保像GPT-4这样的大型语言模型有效且负责任地融入教育环境仍然至关重要。