College of Medicine, University of Saskatchewan, Saskatoon, Saskatchewan, Canada (N.P.M., H.S., H.O., S.J.A.); Department of Medical Imaging, Royal University Hospital, Saskatoon, Saskatchewan, Canada (N.P.M., H.O., S.J.A.).
College of Medicine, University of Saskatchewan, Saskatoon, Saskatchewan, Canada (N.P.M., H.S., H.O., S.J.A.).
Acad Radiol. 2024 Sep;31(9):3872-3878. doi: 10.1016/j.acra.2024.06.046. Epub 2024 Jul 15.
To determine the potential of large language models (LLMs) to be used as tools by radiology educators to create radiology board-style multiple choice questions (MCQs), answers, and rationales.
Two LLMs (Llama 2 and GPT-4) were used to develop 104 MCQs based on the American Board of Radiology exam blueprint. Two board-certified radiologists assessed each MCQ using a 10-point Likert scale across five criteria-clarity, relevance, suitability for a board exam based on level of difficulty, quality of distractors, and adequacy of rationale. For comparison, MCQs from prior American College of Radiology (ACR) Diagnostic Radiology In-Training (DXIT) exams were also assessed using these criteria, with radiologists blinded to the question source.
Mean scores (±standard deviation) for clarity, relevance, suitability, quality of distractors, and adequacy of rationale were 8.7 (±1.4), 9.2 (±1.3), 9.0 (±1.2), 8.4 (±1.9), and 7.2 (±2.2), respectively, for Llama 2; 9.9 (±0.4), 9.9 (±0.5), 9.9 (±0.4), 9.8 (±0.5), and 9.9 (±0.3), respectively, for GPT-4; and 9.9 (±0.3), 9.9 (±0.2), 9.9 (±0.2), 9.9 (±0.4), and 9.8 (±0.6), respectively, for ACR DXIT items (p < 0.001 for Llama 2 vs. ACR DXIT across all criteria; no statistically significant difference for GPT-4 vs. ACR DXIT). The accuracy of model-generated answers was 69% for Llama 2 and 100% for GPT-4.
A state-of-the art LLM such as GPT-4 may be used to develop radiology board-style MCQs and rationales to enhance exam preparation materials and expand exam banks, and may allow radiology educators to further use MCQs as teaching and learning tools.
为了确定大型语言模型(LLM)是否有潜力成为放射学教育者创建放射学执照考试多选题(MCQ)、答案和解释的工具。
使用两个 LLM(Llama 2 和 GPT-4)基于美国放射学委员会考试大纲开发了 104 个 MCQ。两位持有放射学委员会证书的放射科医生使用 10 分制量表对每个 MCQ 进行评估,评估标准为五个方面:清晰度、相关性、根据难度水平适合执照考试、干扰项的质量和解释的充分性。为了进行比较,还使用这些标准评估了之前的美国放射学院(ACR)诊断放射学住院医师培训(DXIT)考试的 MCQ,放射科医生对问题来源不知情。
Llama 2 的清晰度、相关性、适合性、干扰项质量和解释充分性的平均得分(±标准差)分别为 8.7(±1.4)、9.2(±1.3)、9.0(±1.2)、8.4(±1.9)和 7.2(±2.2);GPT-4 分别为 9.9(±0.4)、9.9(±0.5)、9.9(±0.4)、9.8(±0.5)和 9.9(±0.3);ACR DXIT 项目分别为 9.9(±0.3)、9.9(±0.2)、9.9(±0.2)、9.9(±0.4)和 9.8(±0.6)(Llama 2 与 ACR DXIT 相比,所有标准均 p<0.001;GPT-4 与 ACR DXIT 之间无统计学差异)。Llama 2 生成的答案准确率为 69%,GPT-4 为 100%。
像 GPT-4 这样的最先进的 LLM 可用于开发放射学执照考试风格的 MCQ 和解释,以增强考试准备材料并扩大考试题库,并且可能使放射学教育者进一步将 MCQ 用作教学和学习工具。