ChatGPT 在全球医学执照考试不同版本中的表现：系统评价和荟萃分析。

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.

机构信息

Department of Health Communication, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.

Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.

出版信息

J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.

DOI:10.2196/60807

PMID:39052324

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11310649/

Abstract

BACKGROUND

Over the past 2 years, researchers have used various medical licensing examinations to test whether ChatGPT (OpenAI) possesses accurate medical knowledge. The performance of each version of ChatGPT on the medical licensing examination in multiple environments showed remarkable differences. At this stage, there is still a lack of a comprehensive understanding of the variability in ChatGPT's performance on different medical licensing examinations.

OBJECTIVE

In this study, we reviewed all studies on ChatGPT performance in medical licensing examinations up to March 2024. This review aims to contribute to the evolving discourse on artificial intelligence (AI) in medical education by providing a comprehensive analysis of the performance of ChatGPT in various environments. The insights gained from this systematic review will guide educators, policymakers, and technical experts to effectively and judiciously use AI in medical education.

METHODS

We searched the literature published between January 1, 2022, and March 29, 2024, by searching query strings in Web of Science, PubMed, and Scopus. Two authors screened the literature according to the inclusion and exclusion criteria, extracted data, and independently assessed the quality of the literature concerning Quality Assessment of Diagnostic Accuracy Studies-2. We conducted both qualitative and quantitative analyses.

RESULTS

A total of 45 studies on the performance of different versions of ChatGPT in medical licensing examinations were included in this study. GPT-4 achieved an overall accuracy rate of 81% (95% CI 78-84; P<.01), significantly surpassing the 58% (95% CI 53-63; P<.01) accuracy rate of GPT-3.5. GPT-4 passed the medical examinations in 26 of 29 cases, outperforming the average scores of medical students in 13 of 17 cases. Translating the examination questions into English improved GPT-3.5's performance but did not affect GPT-4. GPT-3.5 showed no difference in performance between examinations from English-speaking and non-English-speaking countries (P=.72), but GPT-4 performed better on examinations from English-speaking countries significantly (P=.02). Any type of prompt could significantly improve GPT-3.5's (P=.03) and GPT-4's (P<.01) performance. GPT-3.5 performed better on short-text questions than on long-text questions. The difficulty of the questions affected the performance of GPT-3.5 and GPT-4. In image-based multiple-choice questions (MCQs), ChatGPT's accuracy rate ranges from 13.1% to 100%. ChatGPT performed significantly worse on open-ended questions than on MCQs.

CONCLUSIONS

GPT-4 demonstrates considerable potential for future use in medical education. However, due to its insufficient accuracy, inconsistent performance, and the challenges posed by differing medical policies and knowledge across countries, GPT-4 is not yet suitable for use in medical education.

TRIAL REGISTRATION

PROSPERO CRD42024506687; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=506687.

摘要

背景

在过去的 2 年中，研究人员已经使用各种医学执照考试来测试 ChatGPT（OpenAI）是否具有准确的医学知识。ChatGPT 在多个环境中的医学执照考试中的表现显示出显著的差异。在现阶段，对于 ChatGPT 在不同医学执照考试中的表现的可变性仍然缺乏全面的了解。

目的

本研究回顾了截至 2024 年 3 月所有关于 ChatGPT 在医学执照考试中的表现的研究。本综述旨在通过对 ChatGPT 在各种环境中的表现进行全面分析，为人工智能（AI）在医学教育中的不断发展的讨论做出贡献。从这个系统的综述中获得的见解将指导教育者、政策制定者和技术专家有效和明智地在医学教育中使用 AI。

方法

我们在 Web of Science、PubMed 和 Scopus 中搜索查询字符串，搜索了 2022 年 1 月 1 日至 2024 年 3 月 29 日期间发表的文献。两位作者根据纳入和排除标准筛选文献，提取数据，并独立评估与诊断准确性研究质量评估-2 相关的文献质量。我们进行了定性和定量分析。

结果

本研究共纳入了 45 项关于不同版本的 ChatGPT 在医学执照考试中的表现的研究。GPT-4 的总体准确率为 81%（95%CI 78-84；P<.01），明显高于 GPT-3.5 的 58%（95%CI 53-63；P<.01）。GPT-4 在 29 个案例中的 26 个案例中通过了医学考试，在 17 个案例中的 13 个案例中的表现优于医学生的平均成绩。将考试问题翻译成英语提高了 GPT-3.5 的表现，但对 GPT-4 没有影响。GPT-3.5 在来自英语国家和非英语国家的考试中表现没有差异（P=.72），但 GPT-4 在来自英语国家的考试中表现明显更好（P=.02）。任何类型的提示都可以显著提高 GPT-3.5（P=.03）和 GPT-4（P<.01）的表现。GPT-3.5 在短文本问题上的表现优于长文本问题。问题的难度影响了 GPT-3.5 和 GPT-4 的表现。在基于图像的多项选择题（MCQ）中，ChatGPT 的准确率范围从 13.1%到 100%。ChatGPT 在开放式问题上的表现明显逊于 MCQ。

结论

GPT-4 在未来的医学教育中具有相当大的潜力。然而，由于其准确性不足、表现不一致以及各国医学政策和知识的差异带来的挑战，GPT-4 目前还不适合在医学教育中使用。

试验注册

PROSPERO CRD42024506687; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=506687。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4064/11310649/d36643660d86/jmir_v26i1e60807_fig1.jpg

相似文献

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.

J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.

Large Language Models and Empathy: Systematic Review.

J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.

Performance Assessment of GPT 4.0 on the Japanese Medical Licensing Examination.

Curr Med Sci. 2024 Dec;44(6):1148-1154. doi: 10.1007/s11596-024-2932-9. Epub 2024 Oct 26.

Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations.

JMIR Med Inform. 2025 Jun 27;13:e69485. doi: 10.2196/69485.

Examining the Role of Large Language Models in Orthopedics: Systematic Review.

J Med Internet Res. 2024 Nov 15;26:e59607. doi: 10.2196/59607.

Artificial Intelligence in Orthopaedics: Performance of ChatGPT on Text and Image Questions on a Complete AAOS Orthopaedic In-Training Examination (OITE).

J Surg Educ. 2024 Nov;81(11):1645-1649. doi: 10.1016/j.jsurg.2024.08.002. Epub 2024 Sep 14.

Comparative Performance of Medical Students, ChatGPT-3.5 and ChatGPT-4.0 in Answering Questions From a Brazilian National Medical Exam: Cross-Sectional Questionnaire Study.

JMIR AI. 2025 May 8;4:e66552. doi: 10.2196/66552.

Eliciting adverse effects data from participants in clinical trials.

Cochrane Database Syst Rev. 2018 Jan 16;1(1):MR000039. doi: 10.1002/14651858.MR000039.pub2.

Evaluating the Performance of ChatGPT on Board-Style Examination Questions in Ophthalmology: A Meta-Analysis.

J Med Syst. 2025 Jul 5;49(1):94. doi: 10.1007/s10916-025-02227-7.

Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.

Syst Rev. 2024 Nov 26;13(1):289. doi: 10.1186/s13643-024-02681-3.

引用本文的文献

Automatic- and Transformer-Based Automatic Item Generation: A Critical Review.

J Intell. 2025 Aug 12;13(8):102. doi: 10.3390/jintelligence13080102.

ChatGPT and human dietitian responses to diet-related questions on an online Q&A platform: A comparative study.

Digit Health. 2025 Aug 21;11:20552076251361381. doi: 10.1177/20552076251361381. eCollection 2025 Jan-Dec.

The performance of ChatGPT on medical image-based assessments and implications for medical education.

BMC Med Educ. 2025 Aug 23;25(1):1192. doi: 10.1186/s12909-025-07752-0.

Medical Education Learning Specialists in the Age of Artificial Intelligence.

Cureus. 2025 Jul 21;17(7):e88439. doi: 10.7759/cureus.88439. eCollection 2025 Jul.

Performance of Open-Source Large Language Models in Psychiatry: Usability Study Through Comparative Analysis of Non-English Records and English Translations.

J Med Internet Res. 2025 Aug 18;27:e69857. doi: 10.2196/69857.

Assessing ChatGPT's Educational Potential in Lung Cancer Radiotherapy From Clinician and Patient Perspectives: Content Quality and Readability Analysis.

JMIR Cancer. 2025 Aug 13;11:e69783. doi: 10.2196/69783.

Using Artificial Intelligence ChatGPT to Access Medical Information About Chemical Eye Injuries: Comparative Study.

JMIR Form Res. 2025 Aug 13;9:e73642. doi: 10.2196/73642.

Use and Evaluation of Generative Artificial Intelligence by Medical Students in Japan.

JMA J. 2025 Jul 15;8(3):730-735. doi: 10.31662/jmaj.2024-0375. Epub 2025 Jul 2.

Potential of AI Chatbots in Online Hair Transplantation Consultations: A Multi-metric Assessment of Three Models.

Aesthetic Plast Surg. 2025 Aug 8. doi: 10.1007/s00266-025-05103-4.

Artificial intelligence, health empowerment, and the general practitioner scheme.

Digit Health. 2025 Jul 29;11:20552076251365006. doi: 10.1177/20552076251365006. eCollection 2025 Jan-Dec.

本文引用的文献

Exploring the Performance of ChatGPT-4 in the Taiwan Audiologist Qualification Examination: Preliminary Observational Study Highlighting the Potential of AI Chatbots in Hearing Care.

JMIR Med Educ. 2024 Apr 26;10:e55595. doi: 10.2196/55595.

Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study.

JMIR Med Educ. 2024 Apr 29;10:e55048. doi: 10.2196/55048.

Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.

J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.

Leveraging generative AI for clinical evidence synthesis needs to ensure trustworthiness.

J Biomed Inform. 2024 May;153:104640. doi: 10.1016/j.jbi.2024.104640. Epub 2024 Apr 10.

Embracing ChatGPT for Medical Education: Exploring Its Impact on Doctors and Medical Students.

JMIR Med Educ. 2024 Apr 10;10:e52483. doi: 10.2196/52483.

ChatGPT-4 Performance on USMLE Step 1 Style Questions and Its Implications for Medical Education: A Comparative Study Across Systems and Disciplines.

Med Sci Educ. 2023 Dec 27;34(1):145-152. doi: 10.1007/s40670-023-01956-z. eCollection 2024 Feb.

Can ChatGPT-3.5 Pass a Medical Exam? A Systematic Review of ChatGPT's Performance in Academic Testing.

J Med Educ Curric Dev. 2024 Mar 13;11:23821205241238641. doi: 10.1177/23821205241238641. eCollection 2024 Jan-Dec.

Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study.

JMIR Med Educ. 2024 Mar 12;10:e54393. doi: 10.2196/54393.

Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis.

J Biomed Inform. 2024 Mar;151:104620. doi: 10.1016/j.jbi.2024.104620. Epub 2024 Mar 8.

Exploring the proficiency of ChatGPT-4: An evaluation of its performance in the Taiwan advanced medical licensing examination.

Digit Health. 2024 Mar 5;10:20552076241237678. doi: 10.1177/20552076241237678. eCollection 2024 Jan-Dec.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

ChatGPT 在全球医学执照考试不同版本中的表现：系统评价和荟萃分析。

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

TRIAL REGISTRATION

背景

目的

方法

结果

结论

试验注册

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献