Suppr超能文献

大语言模型在生成放射科 Board 式多项选择题中的应用。

Large Language Models as Tools to Generate Radiology Board-Style Multiple-Choice Questions.

机构信息

College of Medicine, University of Saskatchewan, Saskatoon, Saskatchewan, Canada (N.P.M., H.S., H.O., S.J.A.); Department of Medical Imaging, Royal University Hospital, Saskatoon, Saskatchewan, Canada (N.P.M., H.O., S.J.A.).

College of Medicine, University of Saskatchewan, Saskatoon, Saskatchewan, Canada (N.P.M., H.S., H.O., S.J.A.).

出版信息

Acad Radiol. 2024 Sep;31(9):3872-3878. doi: 10.1016/j.acra.2024.06.046. Epub 2024 Jul 15.

Abstract

RATIONALE AND OBJECTIVES

To determine the potential of large language models (LLMs) to be used as tools by radiology educators to create radiology board-style multiple choice questions (MCQs), answers, and rationales.

METHODS

Two LLMs (Llama 2 and GPT-4) were used to develop 104 MCQs based on the American Board of Radiology exam blueprint. Two board-certified radiologists assessed each MCQ using a 10-point Likert scale across five criteria-clarity, relevance, suitability for a board exam based on level of difficulty, quality of distractors, and adequacy of rationale. For comparison, MCQs from prior American College of Radiology (ACR) Diagnostic Radiology In-Training (DXIT) exams were also assessed using these criteria, with radiologists blinded to the question source.

RESULTS

Mean scores (±standard deviation) for clarity, relevance, suitability, quality of distractors, and adequacy of rationale were 8.7 (±1.4), 9.2 (±1.3), 9.0 (±1.2), 8.4 (±1.9), and 7.2 (±2.2), respectively, for Llama 2; 9.9 (±0.4), 9.9 (±0.5), 9.9 (±0.4), 9.8 (±0.5), and 9.9 (±0.3), respectively, for GPT-4; and 9.9 (±0.3), 9.9 (±0.2), 9.9 (±0.2), 9.9 (±0.4), and 9.8 (±0.6), respectively, for ACR DXIT items (p < 0.001 for Llama 2 vs. ACR DXIT across all criteria; no statistically significant difference for GPT-4 vs. ACR DXIT). The accuracy of model-generated answers was 69% for Llama 2 and 100% for GPT-4.

CONCLUSION

A state-of-the art LLM such as GPT-4 may be used to develop radiology board-style MCQs and rationales to enhance exam preparation materials and expand exam banks, and may allow radiology educators to further use MCQs as teaching and learning tools.

摘要

背景与目的

为了确定大型语言模型(LLM)是否有潜力成为放射学教育者创建放射学执照考试多选题(MCQ)、答案和解释的工具。

方法

使用两个 LLM(Llama 2 和 GPT-4)基于美国放射学委员会考试大纲开发了 104 个 MCQ。两位持有放射学委员会证书的放射科医生使用 10 分制量表对每个 MCQ 进行评估,评估标准为五个方面:清晰度、相关性、根据难度水平适合执照考试、干扰项的质量和解释的充分性。为了进行比较,还使用这些标准评估了之前的美国放射学院(ACR)诊断放射学住院医师培训(DXIT)考试的 MCQ,放射科医生对问题来源不知情。

结果

Llama 2 的清晰度、相关性、适合性、干扰项质量和解释充分性的平均得分(±标准差)分别为 8.7(±1.4)、9.2(±1.3)、9.0(±1.2)、8.4(±1.9)和 7.2(±2.2);GPT-4 分别为 9.9(±0.4)、9.9(±0.5)、9.9(±0.4)、9.8(±0.5)和 9.9(±0.3);ACR DXIT 项目分别为 9.9(±0.3)、9.9(±0.2)、9.9(±0.2)、9.9(±0.4)和 9.8(±0.6)(Llama 2 与 ACR DXIT 相比,所有标准均 p<0.001;GPT-4 与 ACR DXIT 之间无统计学差异)。Llama 2 生成的答案准确率为 69%,GPT-4 为 100%。

结论

像 GPT-4 这样的最先进的 LLM 可用于开发放射学执照考试风格的 MCQ 和解释,以增强考试准备材料并扩大考试题库,并且可能使放射学教育者进一步将 MCQ 用作教学和学习工具。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验