• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ChatGPT-4:美国医师执照考试中人工智能聊天机器人的升级评估。

ChatGPT-4: An assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination.

机构信息

Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada.

Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, Ontario, Canada.

出版信息

Med Teach. 2024 Mar;46(3):366-372. doi: 10.1080/0142159X.2023.2249588. Epub 2023 Oct 15.

DOI:10.1080/0142159X.2023.2249588
PMID:37839017
Abstract

PURPOSE

ChatGPT-4 is an upgraded version of an artificial intelligence chatbot. The performance of ChatGPT-4 on the United States Medical Licensing Examination (USMLE) has not been independently characterized. We aimed to assess the performance of ChatGPT-4 at responding to USMLE Step 1, Step 2CK, and Step 3 practice questions.

METHOD

Practice multiple-choice questions for the USMLE Step 1, Step 2CK, and Step 3 were compiled. Of 376 available questions, 319 (85%) were analyzed by ChatGPT-4 on March 21, 2023. Our primary outcome was the performance of ChatGPT-4 for the practice USMLE Step 1, Step 2CK, and Step 3 examinations, measured as the proportion of multiple-choice questions answered correctly. Our secondary outcomes were the mean length of questions and responses provided by ChatGPT-4.

RESULTS

ChatGPT-4 responded to 319 text-based multiple-choice questions from USMLE practice test material. ChatGPT-4 answered 82 of 93 (88%) questions correctly on USMLE Step 1, 91 of 106 (86%) on Step 2CK, and 108 of 120 (90%) on Step 3. ChatGPT-4 provided explanations for all questions. ChatGPT-4 spent 30.8 ± 11.8 s on average responding to practice questions for USMLE Step 1, 23.0 ± 9.4 s per question for Step 2CK, and 23.1 ± 8.3 s per question for Step 3. The mean length of practice USMLE multiple-choice questions that were answered correctly and incorrectly by ChatGPT-4 was similar (difference = 17.48 characters, SE = 59.75, 95%CI = [-100.09,135.04],  = 0.29,  = 0.77). The mean length of ChatGPT-4's correct responses to practice questions was significantly shorter than the mean length of incorrect responses (difference = 79.58 characters, SE = 35.42, 95%CI = [9.89,149.28],  = 2.25,  = 0.03).

CONCLUSIONS

ChatGPT-4 answered a remarkably high proportion of practice questions correctly for USMLE examinations. ChatGPT-4 performed substantially better at USMLE practice questions than previous models of the same AI chatbot.

摘要

目的

ChatGPT-4 是一种人工智能聊天机器人的升级版。ChatGPT-4 在 美国医师执照考试(USMLE)上的表现尚未得到独立描述。我们旨在评估 ChatGPT-4 回答 USMLE 第一步、第二步 CK 和第三步练习题的能力。

方法

编制了 USMLE 第一步、第二步 CK 和第三步的练习题。在 376 个可用问题中,ChatGPT-4 在 2023 年 3 月 21 日分析了 319 个(85%)。我们的主要结果是 ChatGPT-4 在 USMLE 第一步、第二步 CK 和第三步练习考试中的表现,以正确回答多项选择题的比例来衡量。我们的次要结果是 ChatGPT-4 提供的问题和回答的平均长度。

结果

ChatGPT-4 回答了来自 USMLE 练习题的 319 个基于文本的多项选择题。ChatGPT-4 在 USMLE 第一步中正确回答了 93 个问题中的 82 个(88%),在第二步 CK 中正确回答了 106 个问题中的 91 个(86%),在第三步中正确回答了 120 个问题中的 108 个(90%)。ChatGPT-4 为所有问题都提供了解释。ChatGPT-4 回答 USMLE 第一步练习题的平均用时为 30.8±11.8 秒,第二步 CK 为 23.0±9.4 秒,第三步为 23.1±8.3 秒。ChatGPT-4 正确回答和错误回答的练习题 USMLE 多项选择题的平均长度相似(差异=17.48 个字符,SE=59.75,95%CI= [-100.09,135.04], =0.29, =0.77)。ChatGPT-4 对练习题正确回答的平均长度明显短于错误回答的平均长度(差异=79.58 个字符,SE=35.42,95%CI= [9.89,149.28], =2.25, =0.03)。

结论

ChatGPT-4 对 USMLE 考试的练习题回答了很高比例的正确答案。ChatGPT-4 在 USMLE 练习题上的表现明显优于同一人工智能聊天机器人的先前模型。

相似文献

1
ChatGPT-4: An assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination.ChatGPT-4:美国医师执照考试中人工智能聊天机器人的升级评估。
Med Teach. 2024 Mar;46(3):366-372. doi: 10.1080/0142159X.2023.2249588. Epub 2023 Oct 15.
2
Performance of an Artificial Intelligence Chatbot in Ophthalmic Knowledge Assessment.人工智能聊天机器人在眼科知识评估中的表现。
JAMA Ophthalmol. 2023 Jun 1;141(6):589-597. doi: 10.1001/jamaophthalmol.2023.1144.
3
Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis.纯粹的智慧还是虚假的村庄?对 USMLE Step 3 题型的 ChatGPT 3.5 和 ChatGPT 4 的比较:定量分析。
JMIR Med Educ. 2024 Jan 5;10:e51148. doi: 10.2196/51148.
4
Performance of ChatGPT on Ophthalmology-Related Questions Across Various Examination Levels: Observational Study.ChatGPT 在不同考试级别的眼科相关问题上的表现:观察性研究。
JMIR Med Educ. 2024 Jan 18;10:e50842. doi: 10.2196/50842.
5
How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.ChatGPT在美国医师执照考试(USMLE)中的表现如何?大语言模型对医学教育和知识评估的影响。
JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.
6
Performance of ChatGPT on Nursing Licensure Examinations in the United States and China: Cross-Sectional Study.ChatGPT 在中美护理执照考试中的表现:横断面研究。
JMIR Med Educ. 2024 Oct 3;10:e52746. doi: 10.2196/52746.
7
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现:系统评价和荟萃分析。
J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.
8
Assessing ChatGPT 4.0's test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports.评估ChatGPT 4.0在美国医师执照考试第二步临床知识考试(USMLE STEP 2 CK)及临床病例报告中的测试表现和临床诊断准确性。
Sci Rep. 2024 Apr 23;14(1):9330. doi: 10.1038/s41598-024-58760-x.
9
Examining ChatGPT Performance on USMLE Sample Items and Implications for Assessment.考察 ChatGPT 在 USMLE 样题上的表现及对评估的启示
Acad Med. 2024 Feb 1;99(2):192-197. doi: 10.1097/ACM.0000000000005549. Epub 2023 Nov 7.
10
Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study.ChatGPT在秘鲁国家医学执照考试中的表现:横断面研究
JMIR Med Educ. 2023 Sep 28;9:e48039. doi: 10.2196/48039.

引用本文的文献

1
AI-Driven Large Language Models in Health Consultations for HIV Patients.用于HIV患者健康咨询的人工智能驱动的大语言模型
J Multidiscip Healthc. 2025 Aug 25;18:5187-5198. doi: 10.2147/JMDH.S533621. eCollection 2025.
2
Artificial Intelligence and Large Language Models in the Fight Against Superficial Fungal Infections: Friend or Foe?人工智能和大语言模型在对抗浅表真菌感染中的作用:是友还是敌?
Clin Cosmet Investig Dermatol. 2025 Aug 20;18:1959-1969. doi: 10.2147/CCID.S522271. eCollection 2025.
3
The performance of ChatGPT on medical image-based assessments and implications for medical education.
ChatGPT在基于医学图像的评估中的表现及其对医学教育的影响。
BMC Med Educ. 2025 Aug 23;25(1):1192. doi: 10.1186/s12909-025-07752-0.
4
Evaluating the Performance of ChatGPT on Board-Style Examination Questions in Ophthalmology: A Meta-Analysis.评估ChatGPT在眼科板型考试问题上的表现:一项荟萃分析。
J Med Syst. 2025 Jul 5;49(1):94. doi: 10.1007/s10916-025-02227-7.
5
Evaluation of ChatGPT Performance on Emergency Medicine Board Examination Questions: Observational Study.ChatGPT在急诊医学委员会考试问题上的表现评估:观察性研究。
JMIR AI. 2025 Mar 12;4:e67696. doi: 10.2196/67696.
6
Evaluation of the Performance of Large Language Models in the Management of Axial Spondyloarthropathy: Analysis of EULAR 2022 Recommendations.大型语言模型在轴性脊柱关节炎管理中的性能评估:对欧洲抗风湿病联盟2022年建议的分析
Diagnostics (Basel). 2025 Jun 7;15(12):1455. doi: 10.3390/diagnostics15121455.
7
The sports nutrition knowledge of large language model (LLM) artificial intelligence (AI) chatbots: An assessment of accuracy, completeness, clarity, quality of evidence, and test-retest reliability.大语言模型(LLM)人工智能(AI)聊天机器人的运动营养知识:准确性、完整性、清晰度、证据质量及重测信度评估
PLoS One. 2025 Jun 13;20(6):e0325982. doi: 10.1371/journal.pone.0325982. eCollection 2025.
8
Evaluating performance of large language models for atrial fibrillation management using different prompting strategies and languages.使用不同的提示策略和语言评估大语言模型在房颤管理方面的性能。
Sci Rep. 2025 May 30;15(1):19028. doi: 10.1038/s41598-025-04309-5.
9
Matching Human Expertise: ChatGPT's Performance on Hand Surgery Examinations.匹配人类专业知识:ChatGPT在手外科检查中的表现。
Hand (N Y). 2025 Mar 20:15589447251322914. doi: 10.1177/15589447251322914.
10
ChatGPT 4.0's efficacy in the self-diagnosis of non-traumatic hand conditions.ChatGPT 4.0在非创伤性手部疾病自我诊断中的效能。
J Hand Microsurg. 2025 Jan 23;17(3):100217. doi: 10.1016/j.jham.2025.100217. eCollection 2025 May.