• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

纯粹的智慧还是虚假的村庄?对 USMLE Step 3 题型的 ChatGPT 3.5 和 ChatGPT 4 的比较:定量分析。

Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis.

机构信息

Department of Plastic, Hand and Reconstructive Surgery, University Hospital Regensburg, Regensburg, Germany.

Division of Hand, Plastic and Aesthetic Surgery, Ludwig-Maximilians University Munich, Munich, Germany.

出版信息

JMIR Med Educ. 2024 Jan 5;10:e51148. doi: 10.2196/51148.

DOI:10.2196/51148
PMID:38180782
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10799278/
Abstract

BACKGROUND

The United States Medical Licensing Examination (USMLE) has been critical in medical education since 1992, testing various aspects of a medical student's knowledge and skills through different steps, based on their training level. Artificial intelligence (AI) tools, including chatbots like ChatGPT, are emerging technologies with potential applications in medicine. However, comprehensive studies analyzing ChatGPT's performance on USMLE Step 3 in large-scale scenarios and comparing different versions of ChatGPT are limited.

OBJECTIVE

This paper aimed to analyze ChatGPT's performance on USMLE Step 3 practice test questions to better elucidate the strengths and weaknesses of AI use in medical education and deduce evidence-based strategies to counteract AI cheating.

METHODS

A total of 2069 USMLE Step 3 practice questions were extracted from the AMBOSS study platform. After including 229 image-based questions, a total of 1840 text-based questions were further categorized and entered into ChatGPT 3.5, while a subset of 229 questions were entered into ChatGPT 4. Responses were recorded, and the accuracy of ChatGPT answers as well as its performance in different test question categories and for different difficulty levels were compared between both versions.

RESULTS

Overall, ChatGPT 4 demonstrated a statistically significant superior performance compared to ChatGPT 3.5, achieving an accuracy of 84.7% (194/229) and 56.9% (1047/1840), respectively. A noteworthy correlation was observed between the length of test questions and the performance of ChatGPT 3.5 (ρ=-0.069; P=.003), which was absent in ChatGPT 4 (P=.87). Additionally, the difficulty of test questions, as categorized by AMBOSS hammer ratings, showed a statistically significant correlation with performance for both ChatGPT versions, with ρ=-0.289 for ChatGPT 3.5 and ρ=-0.344 for ChatGPT 4. ChatGPT 4 surpassed ChatGPT 3.5 in all levels of test question difficulty, except for the 2 highest difficulty tiers (4 and 5 hammers), where statistical significance was not reached.

CONCLUSIONS

In this study, ChatGPT 4 demonstrated remarkable proficiency in taking the USMLE Step 3, with an accuracy rate of 84.7% (194/229), outshining ChatGPT 3.5 with an accuracy rate of 56.9% (1047/1840). Although ChatGPT 4 performed exceptionally, it encountered difficulties in questions requiring the application of theoretical concepts, particularly in cardiology and neurology. These insights are pivotal for the development of examination strategies that are resilient to AI and underline the promising role of AI in the realm of medical education and diagnostics.

摘要

背景

自 1992 年以来,美国医师执照考试(USMLE)一直是医学教育的关键,通过不同的步骤,根据学生的培训水平,测试他们知识和技能的各个方面。人工智能(AI)工具,包括 ChatGPT 等聊天机器人,是具有潜在应用的新兴技术在医学领域。然而,全面分析 ChatGPT 在大规模场景下 USMLE 第 3 步的表现,并比较不同版本的 ChatGPT 的综合研究有限。

目的

本文旨在分析 ChatGPT 在 USMLE 第 3 步练习题上的表现,以更好地阐明人工智能在医学教育中的优势和劣势,并得出基于证据的策略来对抗人工智能作弊。

方法

从 AMBOSS 学习平台中提取了 2069 个 USMLE 第 3 步练习题。在包含 229 个基于图像的问题后,进一步将总共 1840 个基于文本的问题分类并输入到 ChatGPT 3.5 中,而 229 个问题的一个子集则输入到 ChatGPT 4 中。记录回答,并比较两个版本的 ChatGPT 回答的准确性以及在不同测试问题类别和不同难度级别上的表现。

结果

总体而言,ChatGPT 4 的表现明显优于 ChatGPT 3.5,准确性分别为 84.7%(194/229)和 56.9%(1047/1840)。ChatGPT 3.5 与测试问题的长度之间存在显著的相关性(ρ=-0.069;P=.003),而 ChatGPT 4 中则不存在(P=.87)。此外,根据 AMBOSS 锤击等级对测试问题的难度进行分类,与两个 ChatGPT 版本的表现均存在统计学相关性,ChatGPT 3.5 的 ρ 值为-0.289,ChatGPT 4 的 ρ 值为-0.344。ChatGPT 4 在除了最高难度级别(4 和 5 锤)之外的所有测试问题难度级别上都超过了 ChatGPT 3.5,但在这些级别上未达到统计学意义。

结论

在这项研究中,ChatGPT 4 在参加 USMLE 第 3 步考试方面表现出色,准确率为 84.7%(194/229),优于准确率为 56.9%(1047/1840)的 ChatGPT 3.5。尽管 ChatGPT 4 表现出色,但它在需要应用理论概念的问题上遇到了困难,特别是在心脏病学和神经病学方面。这些见解对于开发抗人工智能考试策略至关重要,并强调了人工智能在医学教育和诊断领域的有前途的作用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3c77/10799278/bc1e5b63066b/mededu_v10i1e51148_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3c77/10799278/9a6f2d58abc7/mededu_v10i1e51148_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3c77/10799278/aa54f64c6d22/mededu_v10i1e51148_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3c77/10799278/bc1e5b63066b/mededu_v10i1e51148_fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3c77/10799278/9a6f2d58abc7/mededu_v10i1e51148_fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3c77/10799278/aa54f64c6d22/mededu_v10i1e51148_fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3c77/10799278/bc1e5b63066b/mededu_v10i1e51148_fig3.jpg

相似文献

1
Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis.纯粹的智慧还是虚假的村庄?对 USMLE Step 3 题型的 ChatGPT 3.5 和 ChatGPT 4 的比较:定量分析。
JMIR Med Educ. 2024 Jan 5;10:e51148. doi: 10.2196/51148.
2
In-depth analysis of ChatGPT's performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions.基于 2377 个美国医师执照考试(USMLE)第 1 步风格问题题干中的特定信号词和短语,深入分析 ChatGPT 的表现。
Sci Rep. 2024 Jun 12;14(1):13553. doi: 10.1038/s41598-024-63997-7.
3
How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.ChatGPT在美国医师执照考试(USMLE)中的表现如何?大语言模型对医学教育和知识评估的影响。
JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.
4
Performance of ChatGPT on Ophthalmology-Related Questions Across Various Examination Levels: Observational Study.ChatGPT 在不同考试级别的眼科相关问题上的表现:观察性研究。
JMIR Med Educ. 2024 Jan 18;10:e50842. doi: 10.2196/50842.
5
Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments.比较 ChatGPT 和 GPT-4 在 USMLE 软技能评估中的表现。
Sci Rep. 2023 Oct 1;13(1):16492. doi: 10.1038/s41598-023-43436-9.
6
Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study.探讨 ChatGPT 版本 3.5、4 和 4 与 Vision 在智利医师执照考试中的表现:观察性研究。
JMIR Med Educ. 2024 Apr 29;10:e55048. doi: 10.2196/55048.
7
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现:系统评价和荟萃分析。
J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.
8
ChatGPT-4: An assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination.ChatGPT-4:美国医师执照考试中人工智能聊天机器人的升级评估。
Med Teach. 2024 Mar;46(3):366-372. doi: 10.1080/0142159X.2023.2249588. Epub 2023 Oct 15.
9
Performance of ChatGPT on the Chinese Postgraduate Examination for Clinical Medicine: Survey Study.ChatGPT 在临床医学研究生入学考试中的表现:调查研究。
JMIR Med Educ. 2024 Feb 9;10:e48514. doi: 10.2196/48514.
10
ChatGPT's performance in German OB/GYN exams - paving the way for AI-enhanced medical education and clinical practice.ChatGPT在德国妇产科考试中的表现——为人工智能强化医学教育和临床实践铺平道路。
Front Med (Lausanne). 2023 Dec 13;10:1296615. doi: 10.3389/fmed.2023.1296615. eCollection 2023.

引用本文的文献

1
Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.大型语言模型回答临床研究问题的准确性:系统评价与网络荟萃分析
J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.
2
Analyzing Question Characteristics Influencing ChatGPT's Performance in 3000 USMLE®-Style Questions.分析影响ChatGPT在3000道美国医师执照考试(USMLE®)风格题目中表现的问题特征
Med Sci Educ. 2024 Sep 28;35(1):257-267. doi: 10.1007/s40670-024-02176-9. eCollection 2025 Feb.
3
Applications of Artificial Intelligence in Medical Education: A Systematic Review.

本文引用的文献

1
Surgeon or Bot? The Risks of Using Artificial Intelligence in Surgical Journal Publications.外科医生还是机器人?外科医学期刊出版物中使用人工智能的风险。
Ann Surg Open. 2023 Jun 28;4(3):e309. doi: 10.1097/AS9.0000000000000309. eCollection 2023 Sep.
2
ChatGPT's quiz skills in different otolaryngology subspecialties: an analysis of 2576 single-choice and multiple-choice board certification preparation questions.ChatGPT 在不同耳鼻喉科亚专业中的测验技能:对 2576 道选择题和多选题进行 board certification 准备的分析。
Eur Arch Otorhinolaryngol. 2023 Sep;280(9):4271-4278. doi: 10.1007/s00405-023-08051-4. Epub 2023 Jun 7.
3
人工智能在医学教育中的应用:一项系统综述。
Cureus. 2025 Mar 1;17(3):e79878. doi: 10.7759/cureus.79878. eCollection 2025 Mar.
4
Performance Evaluation of Large Language Models in Cervical Cancer Management Based on a Standardized Questionnaire: Comparative Study.基于标准化问卷的大语言模型在宫颈癌管理中的性能评估:比较研究
J Med Internet Res. 2025 Feb 5;27:e63626. doi: 10.2196/63626.
5
Advancements in AI Medical Education: Assessing ChatGPT's Performance on USMLE-Style Questions Across Topics and Difficulty Levels.人工智能医学教育的进展:评估ChatGPT在不同主题和难度级别的美国医师执照考试(USMLE)风格问题上的表现。
Cureus. 2024 Dec 24;16(12):e76309. doi: 10.7759/cureus.76309. eCollection 2024 Dec.
6
Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis.全球医学考试中的大语言模型:平台开发与综合分析
J Med Internet Res. 2024 Dec 27;26:e66114. doi: 10.2196/66114.
7
Will Artificial Intelligence Replace the Medical Toxicologist: Pediatric Referral Thresholds Generated by GPT-4.人工智能会取代医学毒理学家吗:由GPT-4生成的儿科转诊阈值。
J Med Toxicol. 2025 Jan;21(1):85-88. doi: 10.1007/s13181-024-01050-9. Epub 2024 Dec 16.
8
ChatG-PD? Comparing large language model artificial intelligence and faculty rankings of the competitiveness of standardized letters of evaluation.ChatG-PD?比较大语言模型人工智能与标准化推荐信竞争力的教师排名。 (注:原文中ChatG-PD表述有误,可能是ChatGPT,这里按原文翻译)
AEM Educ Train. 2024 Dec 9;8(6):e11052. doi: 10.1002/aet2.11052. eCollection 2024 Dec.
9
Performance of GPT-3.5 and GPT-4 on the Korean Pharmacist Licensing Examination: Comparison Study.GPT-3.5和GPT-4在韩国药剂师执照考试中的表现:比较研究。
JMIR Med Educ. 2024 Dec 4;10:e57451. doi: 10.2196/57451.
10
Large language model doctor: assessing the ability of ChatGPT-4 to deliver interventional radiology procedural information to patients during the consent process.大语言模型医生:评估ChatGPT-4在知情同意过程中向患者提供介入放射学程序信息的能力。
CVIR Endovasc. 2024 Nov 29;7(1):83. doi: 10.1186/s42155-024-00477-z.
The Potential of ChatGPT in Medical Education: Focusing on USMLE Preparation.
ChatGPT在医学教育中的潜力:以美国医师执照考试准备为重点
Ann Biomed Eng. 2023 Oct;51(10):2123-2124. doi: 10.1007/s10439-023-03253-7. Epub 2023 May 29.
4
Artificial intelligence-enabled simulation of gluteal augmentation: A helpful tool in preoperative outcome simulation?人工智能辅助的臀隆模拟:术前结果模拟的有益工具?
J Plast Reconstr Aesthet Surg. 2023 May;80:94-101. doi: 10.1016/j.bjps.2023.01.039. Epub 2023 Feb 9.
5
ChatGPT versus the neurosurgical written boards: a comparative analysis of artificial intelligence/machine learning performance on neurosurgical board-style questions.ChatGPT与神经外科笔试:关于神经外科笔试式问题的人工智能/机器学习性能的比较分析
J Neurosurg. 2023 Mar 24;139(3):904-911. doi: 10.3171/2023.2.JNS23419.
6
The potential impact of ChatGPT in clinical and translational medicine.ChatGPT在临床与转化医学中的潜在影响。
Clin Transl Med. 2023 Mar;13(3):e1206. doi: 10.1002/ctm2.1206.
7
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现:使用大语言模型进行人工智能辅助医学教育的潜力。
PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.
8
ChatGPT passing USMLE shines a spotlight on the flaws of medical education.ChatGPT 通过美国医师执照考试凸显了医学教育的缺陷。
PLOS Digit Health. 2023 Feb 9;2(2):e0000205. doi: 10.1371/journal.pdig.0000205. eCollection 2023 Feb.
9
How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.ChatGPT在美国医师执照考试(USMLE)中的表现如何?大语言模型对医学教育和知识评估的影响。
JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.
10
Artificial Intelligence-Enabled Evaluation of Pain Sketches to Predict Outcomes in Headache Surgery.人工智能辅助的疼痛草图评估,预测头痛手术结局。
Plast Reconstr Surg. 2023 Feb 1;151(2):405-411. doi: 10.1097/PRS.0000000000009855. Epub 2022 Nov 15.