Tekin Murat, Yurdal Mustafa Onur, Toraman Çetin, Korkmaz Güneş, Uysal İbrahim
Medical Education, Çanakkale Onsekiz Mart University, Çanakkale, Turkey.
Medical Education, İstanbul Medeniyet University, İstanbul, Turkey.
BMC Med Educ. 2025 May 1;25(1):641. doi: 10.1186/s12909-025-07241-4.
Objective Structured Clinical Examinations (OSCEs) are widely used in medical education to assess students' clinical and professional skills. Recent advancements in artificial intelligence (AI) offer opportunities to complement human evaluations. This study aims to explore the consistency between human and AI evaluators in assessing medical students' clinical skills during OSCE.
This cross-sectional study was conducted at a state university in Turkey, focusing on pre-clinical medical students (Years 1, 2, and 3). Four clinical skills-intramuscular injection, square knot tying, basic life support, and urinary catheterization-were evaluated during OSCE at the end of the 2023-2024 academic year. Video recordings of the students' performances were assessed by five evaluators: a real-time human assessor, two video-based expert human assessors, and two AI-based systems (ChatGPT-4o and Gemini Flash 1.5). The evaluations were based on standardized checklists validated by the university. Data were collected from 196 students, with sample sizes ranging from 43 to 58 for each skill. Consistency among evaluators was analyzed using statistical methods.
AI models consistently assigned higher scores than human evaluators across all skills. For intramuscular injection, the mean total score given by AI was 28.23, while human evaluators averaged 25.25. For knot tying, AI scores averaged 16.07 versus 10.44 for humans. In basic life support, AI scores were 17.05 versus 16.48 for humans. For urinary catheterization, mean scores were similar (AI: 26.68; humans: 27.02), but showed considerable variance in individual criteria. Inter-rater consistency was higher for visually observable steps, while auditory tasks led to greater discrepancies between AI and human evaluators.
AI shows promise as a supplemental tool for OSCE evaluation, especially for visually based clinical skills. However, its reliability varies depending on the perceptual demands of the skill being assessed. The higher and more uniform scores given by AI suggest potential for standardization, yet refinement is needed for accurate assessment of skills requiring verbal communication or auditory cues.
客观结构化临床考试(OSCEs)在医学教育中被广泛用于评估学生的临床和专业技能。人工智能(AI)的最新进展为辅助人类评估提供了机会。本研究旨在探讨在OSCE期间,人类评估者和人工智能评估者在评估医学生临床技能方面的一致性。
这项横断面研究在土耳其的一所国立大学进行,研究对象为临床前医学生(1、2和3年级)。在2023 - 2024学年末的OSCE期间,对四项临床技能——肌肉注射、打方结、基本生命支持和导尿术——进行了评估。学生表现的视频记录由五名评估者进行评估:一名实时人类评估者、两名基于视频的专家人类评估者以及两个基于人工智能的系统(ChatGPT - 4o和Gemini Flash 1.5)。评估基于该大学验证的标准化检查表。从196名学生中收集数据,每项技能的样本量在43至58之间。使用统计方法分析评估者之间的一致性。
在所有技能方面,人工智能模型给出的分数始终高于人类评估者。对于肌肉注射,人工智能给出的平均总分是28.23,而人类评估者的平均分为25.25。对于打方结,人工智能的平均分数为16.07,人类为10.44。在基本生命支持方面,人工智能的分数为17.05,人类为16.48。对于导尿术,平均分数相似(人工智能:26.68;人类:27.02),但在各个标准上存在相当大的差异。对于视觉可观察的步骤,评估者间的一致性较高,而听觉任务导致人工智能和人类评估者之间的差异更大。
人工智能有望成为OSCE评估的辅助工具,特别是对于基于视觉的临床技能。然而,其可靠性因所评估技能的感知要求而异。人工智能给出的更高且更统一的分数表明了标准化的潜力,但对于需要言语交流或听觉线索的技能进行准确评估仍需要改进。