文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

生成医学检查的大型语言模型:系统评价。

Large language models for generating medical examinations: systematic review.

机构信息

Azrieli Faculty of Medicine, Bar-Ilan University, Ha'Hadas St. 1, Rishon Le Zion, Zefat, 7550598, Israel.

Department of Diagnostic Imaging, Chaim Sheba Medical Center, Ramat Gan, Israel.

出版信息

BMC Med Educ. 2024 Mar 29;24(1):354. doi: 10.1186/s12909-024-05239-y.


DOI:10.1186/s12909-024-05239-y
PMID:38553693
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10981304/
Abstract

BACKGROUND: Writing multiple choice questions (MCQs) for the purpose of medical exams is challenging. It requires extensive medical knowledge, time and effort from medical educators. This systematic review focuses on the application of large language models (LLMs) in generating medical MCQs. METHODS: The authors searched for studies published up to November 2023. Search terms focused on LLMs generated MCQs for medical examinations. Non-English, out of year range and studies not focusing on AI generated multiple-choice questions were excluded. MEDLINE was used as a search database. Risk of bias was evaluated using a tailored QUADAS-2 tool. RESULTS: Overall, eight studies published between April 2023 and October 2023 were included. Six studies used Chat-GPT 3.5, while two employed GPT 4. Five studies showed that LLMs can produce competent questions valid for medical exams. Three studies used LLMs to write medical questions but did not evaluate the validity of the questions. One study conducted a comparative analysis of different models. One other study compared LLM-generated questions with those written by humans. All studies presented faulty questions that were deemed inappropriate for medical exams. Some questions required additional modifications in order to qualify. CONCLUSIONS: LLMs can be used to write MCQs for medical examinations. However, their limitations cannot be ignored. Further study in this field is essential and more conclusive evidence is needed. Until then, LLMs may serve as a supplementary tool for writing medical examinations. 2 studies were at high risk of bias. The study followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.

摘要

背景:编写医学考试用的多项选择题(MCQs)具有挑战性。它需要医学教育者具备广泛的医学知识、时间和精力。本系统评价重点关注大型语言模型(LLMs)在生成医学 MCQs 方面的应用。

方法:作者检索了截至 2023 年 11 月发表的研究。检索词重点关注 LLM 生成的医学考试 MCQs。排除非英语、不在年限范围内和不关注 AI 生成多项选择题的研究。使用 MEDLINE 作为搜索数据库。使用量身定制的 QUADAS-2 工具评估偏倚风险。

结果:总体而言,纳入了 2023 年 4 月至 10 月期间发表的 8 项研究。其中 6 项研究使用了 Chat-GPT 3.5,而 2 项研究使用了 GPT 4。5 项研究表明,LLMs 可以生成胜任的、适合医学考试的问题。3 项研究使用 LLM 编写医学问题,但未评估问题的有效性。1 项研究对不同模型进行了比较分析。另一项研究比较了 LLM 生成的问题和由人类编写的问题。所有研究都提出了有缺陷的问题,认为这些问题不适合医学考试。一些问题需要进一步修改才能合格。

结论:LLMs 可用于编写医学考试的 MCQs。然而,其局限性不容忽视。该领域需要进一步研究,需要更多确凿的证据。在那之前,LLMs 可以作为编写医学考试的辅助工具。有 2 项研究存在高偏倚风险。本研究遵循了系统评价和荟萃分析的首选报告项目(PRISMA)指南。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dabd/10981304/fe743b0607ba/12909_2024_5239_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dabd/10981304/711733e3e869/12909_2024_5239_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dabd/10981304/57a3669c10fb/12909_2024_5239_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dabd/10981304/fe743b0607ba/12909_2024_5239_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dabd/10981304/711733e3e869/12909_2024_5239_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dabd/10981304/57a3669c10fb/12909_2024_5239_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/dabd/10981304/fe743b0607ba/12909_2024_5239_Fig3_HTML.jpg

相似文献

[1]
Large language models for generating medical examinations: systematic review.

BMC Med Educ. 2024-3-29

[2]
Performance of ChatGPT on Nursing Licensure Examinations in the United States and China: Cross-Sectional Study.

JMIR Med Educ. 2024-10-3

[3]
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.

J Med Internet Res. 2024-7-25

[4]
Twelve tips to leverage AI for efficient and effective medical question generation: A guide for educators using Chat GPT.

Med Teach. 2024-8

[5]
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.

Cochrane Database Syst Rev. 2022-2-1

[6]
The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review.

JMIR Med Inform. 2024-5-10

[7]
A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare.

medRxiv. 2024-4-27

[8]
Advantages and pitfalls in utilizing artificial intelligence for crafting medical examinations: a medical education pilot study with GPT-4.

BMC Med Educ. 2023-10-17

[9]
Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study.

JMIR Med Educ. 2024-2-8

[10]
Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany.

JMIR Med Educ. 2023-9-4

引用本文的文献

[1]
Assessing LLM-generated vs. expert-created clinical anatomy MCQs: a student perception-based comparative study in medical education.

Med Educ Online. 2025-12

[2]
Evaluating AI-generated examination papers in periodontology: a comparative study with human-designed counterparts.

BMC Med Educ. 2025-7-23

[3]
OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board-Style Questions.

Ophthalmol Sci. 2025-6-6

[4]
Situating governance and regulatory concerns for generative artificial intelligence and large language models in medical education.

NPJ Digit Med. 2025-5-27

[5]
Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions.

Med Sci Educ. 2025-2-4

[6]
Performance of single-agent and multi-agent language models in Spanish language medical competency exams.

BMC Med Educ. 2025-5-7

[7]
GPT-4's capabilities for formative and summative assessments in Norwegian medicine exams-an intrinsic case study in the early phase of intervention.

Front Med (Lausanne). 2025-4-10

[8]
Delving into the Practical Applications and Pitfalls of Large Language Models in Medical Education: Narrative Review.

Adv Med Educ Pract. 2025-4-18

[9]
Generative AI vs. human expertise: a comparative analysis of case-based rational pharmacotherapy question generation.

Eur J Clin Pharmacol. 2025-6

[10]
Application of large language models in healthcare: A bibliometric analysis.

Digit Health. 2025-3-2

本文引用的文献

[1]
Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions.

Acad Med. 2024-5-1

[2]
Constructing "Burnout": A Critical Discourse Analysis of Burnout in Postgraduate Medical Education.

Acad Med. 2023-11-1

[3]
Automated Patient Note Grading: Examining Scoring Reliability and Feasibility.

Acad Med. 2023-11-1

[4]
Teaching AI Ethics in Medical Education: A Scoping Review of Current Literature and Practices.

Perspect Med Educ. 2023

[5]
An explorative assessment of ChatGPT as an aid in medical education: Use it with caution.

Med Teach. 2024-5

[6]
Advantages and pitfalls in utilizing artificial intelligence for crafting medical examinations: a medical education pilot study with GPT-4.

BMC Med Educ. 2023-10-17

[7]
The future landscape of large language models in medicine.

Commun Med (Lond). 2023-10-10

[8]
Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial.

J Med Internet Res. 2023-10-4

[9]
Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments.

Sci Rep. 2023-10-1

[10]
ChatGPT versus human in generating medical graduate exam multiple choice questions-A multinational prospective study (Hong Kong S.A.R., Singapore, Ireland, and the United Kingdom).

PLoS One. 2023

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索