Department of Radiology, Kahramanmaraş Necip Fazıl City Hospital, Kahramanmaraş Necip Fazıl Şehir Hastanesi, Kahramanmaraş 46050, Turkey (M.S.B.).
Department of Radiology, Hospital Clínic de Barcelona, C. de Villarroel, 170, Barcelona 08036, Spain (L.O.).
Acad Radiol. 2024 Nov;31(11):4365-4371. doi: 10.1016/j.acra.2024.09.005. Epub 2024 Sep 18.
RATIONALE AND OBJECTIVES: This study aims to evaluate the performance of generative pre-trained transformer (GPT)-4o in the complete official European Board of Radiology (EBR) exam, designed to assess radiology knowledge, skills, and competence. MATERIALS AND METHODS: Questions based on text, image, or video and in the format of multiple choice, free-text reporting, or image annotation were uploaded into GPT-4o using standardized prompting. The results were compared to the average scores of radiologists taking the exam in real time. RESULTS: In Part 1 (multiple response questions and short cases), GPT-4o outperformed both the radiologists' average scores and the maximum pass score (70.2% vs. 58.4% and 60%, respectively). In Part 2 (clinically oriented reasoning evaluation), the performance of GPT-4o was below both the radiologists' average scores and the minimum pass score (52.9% vs. 66.1% and 55%, respectively). The accuracy on questions involving ultrasound images was higher compared to other imaging modalities (accuracy rate, 87.5-100%). For video-based questions, the performance was 50.6%. The model achieved the highest accuracy on most likely diagnosis questions but showed lower accuracy in free-text reporting and direct anatomical assessment in images (100% vs. 31% and 28.6%, respectively). CONCLUSION: The abilities of GPT-4o in the official EBR exam are particularly noteworthy. This study demonstrates the potential of large language models to assist radiologists in assessing and managing cases from diagnosis to treatment or follow-up recommendations, even with zero-shot prompting.
背景与目的:本研究旨在评估生成式预训练转换器(GPT-4o)在完整的欧洲放射学会(EBR)官方考试中的表现,该考试旨在评估放射学知识、技能和能力。
材料与方法:基于文本、图像或视频的问题以及多选题、自由文本报告或图像注释的形式,使用标准化提示上传到 GPT-4o。结果与实时参加考试的放射科医生的平均分数进行比较。
结果:在第一部分(多项选择题和短篇病例)中,GPT-4o 的表现优于放射科医生的平均分数和最高及格分数(分别为 70.2%比 58.4%和 60%)。在第二部分(面向临床推理评估)中,GPT-4o 的表现低于放射科医生的平均分数和最低及格分数(分别为 52.9%比 66.1%和 55%)。与其他影像学模式相比,GPT-4o 在涉及超声图像的问题上的准确率更高(准确率,87.5-100%)。对于基于视频的问题,性能为 50.6%。该模型在最可能的诊断问题上取得了最高的准确率,但在自由文本报告和图像中的直接解剖评估方面的准确率较低(分别为 100%比 31%和 28.6%)。
结论:GPT-4o 在 EBR 官方考试中的能力特别值得注意。本研究表明,大型语言模型有可能帮助放射科医生评估和管理从诊断到治疗或随访建议的病例,甚至可以进行零样本提示。