生成医学检查的大型语言模型：系统评价。

Large language models for generating medical examinations: systematic review.

机构信息

Azrieli Faculty of Medicine, Bar-Ilan University, Ha'Hadas St. 1, Rishon Le Zion, Zefat, 7550598, Israel.

Department of Diagnostic Imaging, Chaim Sheba Medical Center, Ramat Gan, Israel.

出版信息

BMC Med Educ. 2024 Mar 29;24(1):354. doi: 10.1186/s12909-024-05239-y.

DOI:10.1186/s12909-024-05239-y

PMID:38553693

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10981304/

Abstract

BACKGROUND

Writing multiple choice questions (MCQs) for the purpose of medical exams is challenging. It requires extensive medical knowledge, time and effort from medical educators. This systematic review focuses on the application of large language models (LLMs) in generating medical MCQs.

METHODS

The authors searched for studies published up to November 2023. Search terms focused on LLMs generated MCQs for medical examinations. Non-English, out of year range and studies not focusing on AI generated multiple-choice questions were excluded. MEDLINE was used as a search database. Risk of bias was evaluated using a tailored QUADAS-2 tool.

RESULTS

Overall, eight studies published between April 2023 and October 2023 were included. Six studies used Chat-GPT 3.5, while two employed GPT 4. Five studies showed that LLMs can produce competent questions valid for medical exams. Three studies used LLMs to write medical questions but did not evaluate the validity of the questions. One study conducted a comparative analysis of different models. One other study compared LLM-generated questions with those written by humans. All studies presented faulty questions that were deemed inappropriate for medical exams. Some questions required additional modifications in order to qualify.

CONCLUSIONS

LLMs can be used to write MCQs for medical examinations. However, their limitations cannot be ignored. Further study in this field is essential and more conclusive evidence is needed. Until then, LLMs may serve as a supplementary tool for writing medical examinations. 2 studies were at high risk of bias. The study followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.

摘要

背景

编写医学考试用的多项选择题（MCQs）具有挑战性。它需要医学教育者具备广泛的医学知识、时间和精力。本系统评价重点关注大型语言模型（LLMs）在生成医学 MCQs 方面的应用。

方法

作者检索了截至 2023 年 11 月发表的研究。检索词重点关注 LLM 生成的医学考试 MCQs。排除非英语、不在年限范围内和不关注 AI 生成多项选择题的研究。使用 MEDLINE 作为搜索数据库。使用量身定制的 QUADAS-2 工具评估偏倚风险。

结果

总体而言，纳入了 2023 年 4 月至 10 月期间发表的 8 项研究。其中 6 项研究使用了 Chat-GPT 3.5，而 2 项研究使用了 GPT 4。5 项研究表明，LLMs 可以生成胜任的、适合医学考试的问题。3 项研究使用 LLM 编写医学问题，但未评估问题的有效性。1 项研究对不同模型进行了比较分析。另一项研究比较了 LLM 生成的问题和由人类编写的问题。所有研究都提出了有缺陷的问题，认为这些问题不适合医学考试。一些问题需要进一步修改才能合格。

结论

LLMs 可用于编写医学考试的 MCQs。然而，其局限性不容忽视。该领域需要进一步研究，需要更多确凿的证据。在那之前，LLMs 可以作为编写医学考试的辅助工具。有 2 项研究存在高偏倚风险。本研究遵循了系统评价和荟萃分析的首选报告项目（PRISMA）指南。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dabd/10981304/711733e3e869/12909_2024_5239_Fig1_HTML.jpg

相似文献

Large language models for generating medical examinations: systematic review.

BMC Med Educ. 2024 Mar 29;24(1):354. doi: 10.1186/s12909-024-05239-y.

Performance of ChatGPT on Nursing Licensure Examinations in the United States and China: Cross-Sectional Study.

JMIR Med Educ. 2024 Oct 3;10:e52746. doi: 10.2196/52746.

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.

J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.

Twelve tips to leverage AI for efficient and effective medical question generation: A guide for educators using Chat GPT.

Med Teach. 2024 Aug;46(8):1021-1026. doi: 10.1080/0142159X.2023.2294703. Epub 2023 Dec 26.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review.

JMIR Med Inform. 2024 May 10;12:e53787. doi: 10.2196/53787.

A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare.

medRxiv. 2024 Apr 27:2024.04.26.24306390. doi: 10.1101/2024.04.26.24306390.

Advantages and pitfalls in utilizing artificial intelligence for crafting medical examinations: a medical education pilot study with GPT-4.

BMC Med Educ. 2023 Oct 17;23(1):772. doi: 10.1186/s12909-023-04752-w.

Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study.

JMIR Med Educ. 2024 Feb 8;10:e50965. doi: 10.2196/50965.

Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany.

JMIR Med Educ. 2023 Sep 4;9:e46482. doi: 10.2196/46482.

引用本文的文献

Assessing LLM-generated vs. expert-created clinical anatomy MCQs: a student perception-based comparative study in medical education.

Med Educ Online. 2025 Dec;30(1):2554678. doi: 10.1080/10872981.2025.2554678. Epub 2025 Aug 30.

Evaluating AI-generated examination papers in periodontology: a comparative study with human-designed counterparts.

BMC Med Educ. 2025 Jul 23;25(1):1099. doi: 10.1186/s12909-025-07706-6.

OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board-Style Questions.

Ophthalmol Sci. 2025 Jun 6;5(6):100844. doi: 10.1016/j.xops.2025.100844. eCollection 2025 Nov-Dec.

Situating governance and regulatory concerns for generative artificial intelligence and large language models in medical education.

NPJ Digit Med. 2025 May 27;8(1):315. doi: 10.1038/s41746-025-01721-z.

Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions.

Med Sci Educ. 2025 Feb 4;35(2):721-729. doi: 10.1007/s40670-025-02293-z. eCollection 2025 Apr.

Performance of single-agent and multi-agent language models in Spanish language medical competency exams.

BMC Med Educ. 2025 May 7;25(1):666. doi: 10.1186/s12909-025-07250-3.

GPT-4's capabilities for formative and summative assessments in Norwegian medicine exams-an intrinsic case study in the early phase of intervention.

Front Med (Lausanne). 2025 Apr 10;12:1441747. doi: 10.3389/fmed.2025.1441747. eCollection 2025.

Delving into the Practical Applications and Pitfalls of Large Language Models in Medical Education: Narrative Review.

Adv Med Educ Pract. 2025 Apr 18;16:625-636. doi: 10.2147/AMEP.S497020. eCollection 2025.

Generative AI vs. human expertise: a comparative analysis of case-based rational pharmacotherapy question generation.

Eur J Clin Pharmacol. 2025 Jun;81(6):875-883. doi: 10.1007/s00228-025-03838-2. Epub 2025 Apr 9.

Application of large language models in healthcare: A bibliometric analysis.

Digit Health. 2025 Mar 2;11:20552076251324444. doi: 10.1177/20552076251324444. eCollection 2025 Jan-Dec.

本文引用的文献

Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions.

Acad Med. 2024 May 1;99(5):508-512. doi: 10.1097/ACM.0000000000005626. Epub 2023 Dec 28.

Constructing "Burnout": A Critical Discourse Analysis of Burnout in Postgraduate Medical Education.

Acad Med. 2023 Nov 1;98(11S):S116-S122. doi: 10.1097/ACM.0000000000005358. Epub 2023 Jul 28.

Automated Patient Note Grading: Examining Scoring Reliability and Feasibility.

Acad Med. 2023 Nov 1;98(11S):S90-S97. doi: 10.1097/ACM.0000000000005357. Epub 2023 Aug 1.

Teaching AI Ethics in Medical Education: A Scoping Review of Current Literature and Practices.

Perspect Med Educ. 2023 Oct 16;12(1):399-410. doi: 10.5334/pme.954. eCollection 2023.

An explorative assessment of ChatGPT as an aid in medical education: Use it with caution.

Med Teach. 2024 May;46(5):657-664. doi: 10.1080/0142159X.2023.2271159. Epub 2023 Oct 20.

Advantages and pitfalls in utilizing artificial intelligence for crafting medical examinations: a medical education pilot study with GPT-4.

BMC Med Educ. 2023 Oct 17;23(1):772. doi: 10.1186/s12909-023-04752-w.

The future landscape of large language models in medicine.

Commun Med (Lond). 2023 Oct 10;3(1):141. doi: 10.1038/s43856-023-00370-1.

Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial.

J Med Internet Res. 2023 Oct 4;25:e50638. doi: 10.2196/50638.

Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments.

Sci Rep. 2023 Oct 1;13(1):16492. doi: 10.1038/s41598-023-43436-9.

ChatGPT versus human in generating medical graduate exam multiple choice questions-A multinational prospective study (Hong Kong S.A.R., Singapore, Ireland, and the United Kingdom).

PLoS One. 2023 Aug 29;18(8):e0290691. doi: 10.1371/journal.pone.0290691. eCollection 2023.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

生成医学检查的大型语言模型：系统评价。

Large language models for generating medical examinations: systematic review.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献