Azrieli Faculty of Medicine, Bar-Ilan University, Ha'Hadas St. 1, Rishon Le Zion, Zefat, 7550598, Israel.
Department of Diagnostic Imaging, Chaim Sheba Medical Center, Ramat Gan, Israel.
BMC Med Educ. 2024 Mar 29;24(1):354. doi: 10.1186/s12909-024-05239-y.
BACKGROUND: Writing multiple choice questions (MCQs) for the purpose of medical exams is challenging. It requires extensive medical knowledge, time and effort from medical educators. This systematic review focuses on the application of large language models (LLMs) in generating medical MCQs. METHODS: The authors searched for studies published up to November 2023. Search terms focused on LLMs generated MCQs for medical examinations. Non-English, out of year range and studies not focusing on AI generated multiple-choice questions were excluded. MEDLINE was used as a search database. Risk of bias was evaluated using a tailored QUADAS-2 tool. RESULTS: Overall, eight studies published between April 2023 and October 2023 were included. Six studies used Chat-GPT 3.5, while two employed GPT 4. Five studies showed that LLMs can produce competent questions valid for medical exams. Three studies used LLMs to write medical questions but did not evaluate the validity of the questions. One study conducted a comparative analysis of different models. One other study compared LLM-generated questions with those written by humans. All studies presented faulty questions that were deemed inappropriate for medical exams. Some questions required additional modifications in order to qualify. CONCLUSIONS: LLMs can be used to write MCQs for medical examinations. However, their limitations cannot be ignored. Further study in this field is essential and more conclusive evidence is needed. Until then, LLMs may serve as a supplementary tool for writing medical examinations. 2 studies were at high risk of bias. The study followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.
背景:编写医学考试用的多项选择题(MCQs)具有挑战性。它需要医学教育者具备广泛的医学知识、时间和精力。本系统评价重点关注大型语言模型(LLMs)在生成医学 MCQs 方面的应用。
方法:作者检索了截至 2023 年 11 月发表的研究。检索词重点关注 LLM 生成的医学考试 MCQs。排除非英语、不在年限范围内和不关注 AI 生成多项选择题的研究。使用 MEDLINE 作为搜索数据库。使用量身定制的 QUADAS-2 工具评估偏倚风险。
结果:总体而言,纳入了 2023 年 4 月至 10 月期间发表的 8 项研究。其中 6 项研究使用了 Chat-GPT 3.5,而 2 项研究使用了 GPT 4。5 项研究表明,LLMs 可以生成胜任的、适合医学考试的问题。3 项研究使用 LLM 编写医学问题,但未评估问题的有效性。1 项研究对不同模型进行了比较分析。另一项研究比较了 LLM 生成的问题和由人类编写的问题。所有研究都提出了有缺陷的问题,认为这些问题不适合医学考试。一些问题需要进一步修改才能合格。
结论:LLMs 可用于编写医学考试的 MCQs。然而,其局限性不容忽视。该领域需要进一步研究,需要更多确凿的证据。在那之前,LLMs 可以作为编写医学考试的辅助工具。有 2 项研究存在高偏倚风险。本研究遵循了系统评价和荟萃分析的首选报告项目(PRISMA)指南。
BMC Med Educ. 2024-3-29
Cochrane Database Syst Rev. 2022-2-1
JMIR Med Inform. 2024-5-10
Digit Health. 2025-3-2
Commun Med (Lond). 2023-10-10
J Med Internet Res. 2023-10-4
Sci Rep. 2023-10-1