Liu Hao-Yun, Chen Shyh-Jye, Wang Weichung, Lee Chung-Hsi, Hsu Hsian-He, Shen Shu-Huei, Chiou Hong-Jen, Lee Wen-Jeng
Department of Medical Imaging, National Taiwan University Hospital Hsin-Chu Branch, Br. No.25, Lane 442, Sec.1, Jingguo Rd., Hsinchu City 300, Taiwan ROC (H.Y.L.); National Taiwan University College of Medicine, No.1 Jen Ai Road Section 1, Taipei 100, Taiwan ROC (H.Y.L.).
Department of Medical Imaging, National Taiwan University Hospital, No. 7 Zhongshan S. Rd., Zhongzheng Dist., Taipei City 100229, Taiwan (S.J.C., W.J.L.).
Acad Radiol. 2025 May 27. doi: 10.1016/j.acra.2025.05.023.
The radiology specialty examination assesses clinical decision-making, image interpretation, and diagnostic reasoning. With the expansion of medical knowledge, traditional test design faces challenges in maintaining accuracy and relevance. Large language models (LLMs) demonstrate potential in medical education. This study evaluates LLM performance in radiology specialty exams, explores their role in assessing question difficulty, and investigates their reasoning processes, aiming to develop a more objective and efficient framework for exam design.
This study compared the performance of LLMs and human examinees in a radiology specialty examination. Three LLMs (GPT-4o, o1-preview, and GPT-3.5-turbo-1106) were evaluated under zero-shot conditions. Exam accuracy, examinee accuracy, discrimination index, and point-biserial correlation were used to assess LLMs' ability to predict question difficulty and reasoning processes. The data provided by the Taiwan Radiological Society ensures comparability between AI and human performance.
As for accuracy, GPT-4o (88.0%) and o1-preview (90.9%) outperformed human examinees (76.3%), whereas GPT-3.5-turbo-1106 showed significantly lower accuracy (50.2%). Question difficulty analysis revealed that newer LLMs excel in solving complex questions, while GPT-3.5-turbo-1106 exhibited greater performance variability. Discrimination index and point-biserial Correlation analyses demonstrated that GPT-4o and o1-preview accurately identified key differentiating questions, closely mirroring human reasoning patterns. These findings suggest that advanced LLMs can assess medical examination difficulty, offering potential applications in exam standardization and question evaluation.
This study evaluated the problem-solving capabilities of GPT-3.5-turbo-1106, GPT-4o, and o1-preview in a radiology specialty examination. LLMs should be utilized as tools for assessing exam question difficulty and assisting in the standardized development of medical examinations.
放射学专业考试评估临床决策、图像解读和诊断推理能力。随着医学知识的扩展,传统的考试设计在保持准确性和相关性方面面临挑战。大语言模型(LLMs)在医学教育中展现出潜力。本研究评估了大语言模型在放射学专业考试中的表现,探讨它们在评估题目难度方面的作用,并研究其推理过程,旨在开发一个更客观、高效的考试设计框架。
本研究比较了大语言模型和人类考生在放射学专业考试中的表现。在零样本条件下评估了三个大语言模型(GPT-4o、o1-preview和GPT-3.5-turbo-1106)。使用考试准确率、考生准确率、区分度指数和点二列相关性来评估大语言模型预测题目难度和推理过程的能力。台湾放射学会提供的数据确保了人工智能与人类表现之间的可比性。
在准确率方面,GPT-4o(88.0%)和o1-preview(90.9%)优于人类考生(76.3%),而GPT-3.5-turbo-1106的准确率显著较低(50.2%)。题目难度分析表明,较新的大语言模型在解决复杂问题方面表现出色,而GPT-3.5-turbo-1106的表现差异更大。区分度指数和点二列相关性分析表明,GPT-4o和o1-preview能够准确识别关键的区分性题目,与人类推理模式密切相似。这些发现表明,先进的大语言模型可以评估医学考试难度,在考试标准化和题目评估中具有潜在应用。
本研究评估了GPT-3.5-turbo-1106、GPT-4o和o1-preview在放射学专业考试中的解决问题能力。大语言模型应被用作评估考试题目难度和协助医学考试标准化发展的工具。