Suppr超能文献

生成医学检查的大型语言模型:系统评价。

Large language models for generating medical examinations: systematic review.

机构信息

Azrieli Faculty of Medicine, Bar-Ilan University, Ha'Hadas St. 1, Rishon Le Zion, Zefat, 7550598, Israel.

Department of Diagnostic Imaging, Chaim Sheba Medical Center, Ramat Gan, Israel.

出版信息

BMC Med Educ. 2024 Mar 29;24(1):354. doi: 10.1186/s12909-024-05239-y.

Abstract

BACKGROUND

Writing multiple choice questions (MCQs) for the purpose of medical exams is challenging. It requires extensive medical knowledge, time and effort from medical educators. This systematic review focuses on the application of large language models (LLMs) in generating medical MCQs.

METHODS

The authors searched for studies published up to November 2023. Search terms focused on LLMs generated MCQs for medical examinations. Non-English, out of year range and studies not focusing on AI generated multiple-choice questions were excluded. MEDLINE was used as a search database. Risk of bias was evaluated using a tailored QUADAS-2 tool.

RESULTS

Overall, eight studies published between April 2023 and October 2023 were included. Six studies used Chat-GPT 3.5, while two employed GPT 4. Five studies showed that LLMs can produce competent questions valid for medical exams. Three studies used LLMs to write medical questions but did not evaluate the validity of the questions. One study conducted a comparative analysis of different models. One other study compared LLM-generated questions with those written by humans. All studies presented faulty questions that were deemed inappropriate for medical exams. Some questions required additional modifications in order to qualify.

CONCLUSIONS

LLMs can be used to write MCQs for medical examinations. However, their limitations cannot be ignored. Further study in this field is essential and more conclusive evidence is needed. Until then, LLMs may serve as a supplementary tool for writing medical examinations. 2 studies were at high risk of bias. The study followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.

摘要

背景

编写医学考试用的多项选择题(MCQs)具有挑战性。它需要医学教育者具备广泛的医学知识、时间和精力。本系统评价重点关注大型语言模型(LLMs)在生成医学 MCQs 方面的应用。

方法

作者检索了截至 2023 年 11 月发表的研究。检索词重点关注 LLM 生成的医学考试 MCQs。排除非英语、不在年限范围内和不关注 AI 生成多项选择题的研究。使用 MEDLINE 作为搜索数据库。使用量身定制的 QUADAS-2 工具评估偏倚风险。

结果

总体而言,纳入了 2023 年 4 月至 10 月期间发表的 8 项研究。其中 6 项研究使用了 Chat-GPT 3.5,而 2 项研究使用了 GPT 4。5 项研究表明,LLMs 可以生成胜任的、适合医学考试的问题。3 项研究使用 LLM 编写医学问题,但未评估问题的有效性。1 项研究对不同模型进行了比较分析。另一项研究比较了 LLM 生成的问题和由人类编写的问题。所有研究都提出了有缺陷的问题,认为这些问题不适合医学考试。一些问题需要进一步修改才能合格。

结论

LLMs 可用于编写医学考试的 MCQs。然而,其局限性不容忽视。该领域需要进一步研究,需要更多确凿的证据。在那之前,LLMs 可以作为编写医学考试的辅助工具。有 2 项研究存在高偏倚风险。本研究遵循了系统评价和荟萃分析的首选报告项目(PRISMA)指南。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dabd/10981304/711733e3e869/12909_2024_5239_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验