• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ChatGPT-4o真的能通过医学科学考试吗?使用新颖问题的务实分析。

Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions.

作者信息

Newton Philip M, Summers Christopher J, Zaheer Uzman, Xiromeriti Maira, Stokes Jemima R, Bhangu Jaskaran Singh, Roome Elis G, Roberts-Phillips Alanna, Mazaheri-Asadi Darius, Jones Cameron D, Hughes Stuart, Gilbert Dominic, Jones Ewan, Essex Keioni, Ellis Emily C, Davey Ross, Cox Adrienne A, Bassett Jessica A

机构信息

Swansea University Medical School, Swansea, Wales, SA2 8PP UK.

出版信息

Med Sci Educ. 2025 Feb 4;35(2):721-729. doi: 10.1007/s40670-025-02293-z. eCollection 2025 Apr.

DOI:10.1007/s40670-025-02293-z
PMID:40352979
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12058600/
Abstract

UNLABELLED

ChatGPT apparently shows excellent performance on high-level professional exams such as those involved in medical assessment and licensing. This has raised concerns that ChatGPT could be used for academic misconduct, especially in unproctored online exams. However, ChatGPT has previously shown weaker performance on questions with pictures, and there have been concerns that ChatGPT's performance may be artificially inflated by the public nature of the sample questions tested, meaning they likely formed part of the training materials for ChatGPT. This led to suggestions that cheating could be mitigated by using novel questions for every sitting of an exam and making extensive use of picture-based questions. These approaches remain untested. Here, we tested the performance of ChatGPT-4o on existing medical licensing exams in the UK and USA, and on novel questions based on those exams. ChatGPT-4o scored 94% on the United Kingdom Medical Licensing Exam Applied Knowledge Test and 89.9% on the United States Medical Licensing Exam Step 1. Performance was not diminished when the questions were rewritten into novel versions, or on completely novel questions which were not based on any existing questions. ChatGPT did show reduced performance on questions containing images when the answer options were added to an image as text labels. These data demonstrate that the performance of ChatGPT continues to improve and that secure testing environments are required for the valid assessment of both foundational and higher order learning.

SUPPLEMENTARY INFORMATION

The online version contains supplementary material available at 10.1007/s40670-025-02293-z.

摘要

未标注

ChatGPT在诸如医学评估和执照考试等高级专业考试中表现出卓越的成绩。这引发了人们对ChatGPT可能被用于学术不端行为的担忧,尤其是在无监考的在线考试中。然而,ChatGPT此前在带有图片的问题上表现较弱,并且有人担心ChatGPT的成绩可能因所测试的样本题目的公开性质而被人为抬高,这意味着这些题目可能是ChatGPT训练材料的一部分。这导致有人建议,通过每次考试使用新颖的题目并大量使用基于图片的题目,可以减轻作弊现象。这些方法尚未经过测试。在此,我们测试了ChatGPT-4o在英国和美国现有医学执照考试以及基于这些考试的新颖题目上的表现。ChatGPT-4o在英国医学执照考试应用知识测试中得分为94%,在美国医学执照考试第一步中得分为89.9%。当题目被改写为新颖版本时,或者在完全不基于任何现有题目的新颖题目上,其表现并未降低。当答案选项以文本标签的形式添加到图片中时,ChatGPT在包含图片的题目上确实表现出成绩下降。这些数据表明ChatGPT的性能在持续提高,并且需要安全的测试环境来有效评估基础学习和高阶学习。

补充信息

在线版本包含可在10.1007/s40670-025-02293-z获取的补充材料。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f513/12058600/6a5232d84962/40670_2025_2293_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f513/12058600/6a5232d84962/40670_2025_2293_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f513/12058600/6a5232d84962/40670_2025_2293_Fig1_HTML.jpg

相似文献

1
Can ChatGPT-4o Really Pass Medical Science Exams? A Pragmatic Analysis Using Novel Questions.ChatGPT-4o真的能通过医学科学考试吗?使用新颖问题的务实分析。
Med Sci Educ. 2025 Feb 4;35(2):721-729. doi: 10.1007/s40670-025-02293-z. eCollection 2025 Apr.
2
Comparing performances of french orthopaedic surgery residents with the artificial intelligence ChatGPT-4/4o in the French diploma exams of orthopaedic and trauma surgery.在法国骨科与创伤外科文凭考试中,比较法国骨科住院医师与人工智能ChatGPT-4/4o的表现。
Orthop Traumatol Surg Res. 2024 Dec 4:104080. doi: 10.1016/j.otsr.2024.104080.
3
Artificial Intelligence vs. Human Cognition: A Comparative Analysis of ChatGPT and Candidates Sitting the European Board of Ophthalmology Diploma Examination.人工智能与人类认知:ChatGPT与参加欧洲眼科委员会文凭考试的考生的对比分析
Vision (Basel). 2025 Apr 9;9(2):31. doi: 10.3390/vision9020031.
4
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现:系统评价和荟萃分析。
J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.
5
ChatGPT's performance in German OB/GYN exams - paving the way for AI-enhanced medical education and clinical practice.ChatGPT在德国妇产科考试中的表现——为人工智能强化医学教育和临床实践铺平道路。
Front Med (Lausanne). 2023 Dec 13;10:1296615. doi: 10.3389/fmed.2023.1296615. eCollection 2023.
6
How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.ChatGPT在美国医师执照考试(USMLE)中的表现如何?大语言模型对医学教育和知识评估的影响。
JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.
7
Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study.探讨 ChatGPT 版本 3.5、4 和 4 与 Vision 在智利医师执照考试中的表现:观察性研究。
JMIR Med Educ. 2024 Apr 29;10:e55048. doi: 10.2196/55048.
8
Comparative Analysis of ChatGPT-4o and Gemini Advanced Performance on Diagnostic Radiology In-Training Exams.ChatGPT-4o与Gemini在放射诊断学培训考试中的性能对比分析
Cureus. 2025 Mar 20;17(3):e80874. doi: 10.7759/cureus.80874. eCollection 2025 Mar.
9
ChatGPT's Performance on the Hand Surgery Self-Assessment Exam: A Critical Analysis.ChatGPT在手外科自我评估考试中的表现:一项批判性分析。
J Hand Surg Glob Online. 2024 Jan 2;6(2):200-205. doi: 10.1016/j.jhsg.2023.11.014. eCollection 2024 Mar.
10
Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis.纯粹的智慧还是虚假的村庄?对 USMLE Step 3 题型的 ChatGPT 3.5 和 ChatGPT 4 的比较:定量分析。
JMIR Med Educ. 2024 Jan 5;10:e51148. doi: 10.2196/51148.

引用本文的文献

1
Foundation models for radiology-the position of the AI for Health Imaging (AI4HI) network.放射学基础模型——健康影像人工智能(AI4HI)网络的立场
Insights Imaging. 2025 Aug 6;16(1):168. doi: 10.1186/s13244-025-02056-9.
2
Comparative analysis of large language models in clinical diagnosis: performance evaluation across common and complex medical cases.大语言模型在临床诊断中的比较分析:常见和复杂医疗病例的性能评估
JAMIA Open. 2025 Jun 12;8(3):ooaf055. doi: 10.1093/jamiaopen/ooaf055. eCollection 2025 Jun.

本文引用的文献

1
ChatGPT-4 Performance on USMLE Step 1 Style Questions and Its Implications for Medical Education: Correspondence.ChatGPT-4在美国医师执照考试第一步(USMLE Step 1)题型问题上的表现及其对医学教育的启示:通信。
Med Sci Educ. 2024 Apr 5;34(3):715. doi: 10.1007/s40670-024-02033-9. eCollection 2024 Jun.
2
Introducing AnatomyGPT: A customized artificial intelligence application for anatomical sciences education.介绍 AnatomyGPT:一个用于解剖科学教育的定制人工智能应用程序。
Clin Anat. 2024 Sep;37(6):661-669. doi: 10.1002/ca.24178. Epub 2024 May 9.
3
Medical education with large language models in ophthalmology: custom instructions and enhanced retrieval capabilities.
医学教育与大语言模型在眼科学中的应用:定制指令和增强检索功能。
Br J Ophthalmol. 2024 Sep 20;108(10):1354-1361. doi: 10.1136/bjo-2023-325046.
4
Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.比较流行的大语言模型在国家医学考试委员会样题上的表现。
Cureus. 2024 Mar 11;16(3):e55991. doi: 10.7759/cureus.55991. eCollection 2024 Mar.
5
Large language models for generating medical examinations: systematic review.生成医学检查的大型语言模型:系统评价。
BMC Med Educ. 2024 Mar 29;24(1):354. doi: 10.1186/s12909-024-05239-y.
6
ChatGPT-4 Performance on USMLE Step 1 Style Questions and Its Implications for Medical Education: A Comparative Study Across Systems and Disciplines.ChatGPT-4在美国医师执照考试第一步(USMLE Step 1)题型问题上的表现及其对医学教育的影响:跨系统和学科的比较研究
Med Sci Educ. 2023 Dec 27;34(1):145-152. doi: 10.1007/s40670-023-01956-z. eCollection 2024 Feb.
7
Generative pretrained transformer-4, an artificial intelligence text predictive model, has a high capability for passing novel written radiology exam questions.生成式预训练转换器-4,一种人工智能文本预测模型,具有通过新型书面放射科考试问题的高能力。
Int J Comput Assist Radiol Surg. 2024 Apr;19(4):645-653. doi: 10.1007/s11548-024-03071-9. Epub 2024 Feb 21.
8
Case-based MCQ generator: A custom ChatGPT based on published prompts in the literature for automatic item generation.基于病例的多项选择题生成器:一种自定义的 ChatGPT,基于文献中发布的提示进行自动项目生成。
Med Teach. 2024 Aug;46(8):1018-1020. doi: 10.1080/0142159X.2024.2314723. Epub 2024 Feb 10.
9
A large-scale comparison of human-written versus ChatGPT-generated essays.人工撰写与ChatGPT生成的文章的大规模比较。
Sci Rep. 2023 Oct 30;13(1):18617. doi: 10.1038/s41598-023-45644-9.
10
Evaluating the performance of ChatGPT-4 on the United Kingdom Medical Licensing Assessment.评估ChatGPT-4在英国医学执照评估中的表现。
Front Med (Lausanne). 2023 Sep 19;10:1240915. doi: 10.3389/fmed.2023.1240915. eCollection 2023.