• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

关于ChatGPT在医学住院医师考试中表现的多模型纵向评估。

A multi-model longitudinal assessment of ChatGPT performance on medical residency examinations.

作者信息

Souto Maria Eduarda Varela Cavalcanti, Fernandes Alexandre Chaves, Silva Ana Beatriz Santana, de Freitas Ribeiro Louise Helena, de Medeiros Fernandes Thales Allyrio Araújo

机构信息

Department of Biomedical Sciences, School of Health Sciences, State University of Rio Grande do Norte, Mossoró, Brazil.

Institute of Mathematics and Computer Sciences, University of São Paulo, São Paulo, Brazil.

出版信息

Front Artif Intell. 2025 Aug 22;8:1614874. doi: 10.3389/frai.2025.1614874. eCollection 2025.

DOI:10.3389/frai.2025.1614874
PMID:40918587
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12411524/
Abstract

INTRODUCTION

ChatGPT, a generative artificial intelligence, has potential applications in numerous fields, including medical education. This potential can be assessed through its performance on medical exams. Medical residency exams, critical for entering medical specialties, serve as a valuable benchmark.

MATERIALS AND METHODS

This study aimed to assess the accuracy of ChatGPT-4 and GPT-4o in responding to 1,041 medical residency questions from Brazil, examining overall accuracy and performance across different medical areas, based on evaluations conducted in 2023 and 2024. The questions were classified into higher and lower cognitive levels according to Bloom's taxonomy. Additionally, questions answered incorrectly by both models were tested using the recent GPT models that use chain-of-thought reasoning (e.g., o1-preview, o3, o4-mini-high) with evaluations carried out in both 2024 and 2025.

RESULTS

GPT-4 achieved 81.27% accuracy (95% CI: 78.89-83.64%), while GPT-4o reached 85.88% (95% CI: 83.76-88.00%), significantly outperforming GPT-4 ( < 0.05). Both models showed reduced accuracy on higher-order thinking questions. On questions that both models failed, GPT o1-preview achieved 53.26% accuracy (95% CI: 42.87-63.65%), GPT o3 47.83% (95% CI: 37.42-58.23%) and o4-mini-high 35.87% (95% CI: 25.88-45.86%), with all three models performing better on higher-order questions.

CONCLUSION

Artificial intelligence could be a beneficial tool in medical education, enhancing residency exam preparation, helping to understand complex topics, and improving teaching strategies. However, careful use of artificial intelligence is essential due to ethical concerns and potential limitations in both educational and clinical practice.

摘要

引言

ChatGPT是一种生成式人工智能,在包括医学教育在内的众多领域都有潜在应用。这种潜力可以通过其在医学考试中的表现来评估。医学住院医师考试对于进入医学专科至关重要,是一个有价值的基准。

材料与方法

本研究旨在评估ChatGPT-4和GPT-4o对来自巴西的1041道医学住院医师考试问题的回答准确性,根据2023年和2024年进行的评估,考察整体准确性以及在不同医学领域的表现。这些问题根据布鲁姆分类法分为高认知水平和低认知水平。此外,对两个模型都答错的问题,使用采用思维链推理的最新GPT模型(如o1-preview、o3、o4-mini-high)进行测试,并在2024年和2025年进行评估。

结果

GPT-4的准确率为81.27%(95%置信区间:78.89-83.64%),而GPT-4o达到85.88%(95%置信区间:83.76-88.00%),显著优于GPT-4(<0.05)。两个模型在高阶思维问题上的准确率都有所降低。在两个模型都答错的问题上,GPT o1-preview的准确率为53.26%(95%置信区间:42.87-63.65%),GPT o3为47.83%(95%置信区间:37.42-58.23%),o4-mini-high为35.87%(95%置信区间:25.88-45.86%),这三个模型在高阶问题上的表现都更好。

结论

人工智能在医学教育中可能是一个有益的工具,可加强住院医师考试准备、帮助理解复杂主题并改进教学策略。然而,由于伦理问题以及教育和临床实践中的潜在局限性,谨慎使用人工智能至关重要。

相似文献

1
A multi-model longitudinal assessment of ChatGPT performance on medical residency examinations.关于ChatGPT在医学住院医师考试中表现的多模型纵向评估。
Front Artif Intell. 2025 Aug 22;8:1614874. doi: 10.3389/frai.2025.1614874. eCollection 2025.
2
Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study.使用心身医学考试问题评估 ChatGPT 对布鲁姆教育目标分类法的掌握程度:混合方法研究。
J Med Internet Res. 2024 Jan 23;26:e52113. doi: 10.2196/52113.
3
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现:系统评价和荟萃分析。
J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.
4
Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study.揭示GPT-4V在美国医师执照考试(USMLE)问题上高精度背后的隐藏挑战:观察性研究。
J Med Internet Res. 2025 Feb 7;27:e65146. doi: 10.2196/65146.
5
The performance of ChatGPT on medical image-based assessments and implications for medical education.ChatGPT在基于医学图像的评估中的表现及其对医学教育的影响。
BMC Med Educ. 2025 Aug 23;25(1):1192. doi: 10.1186/s12909-025-07752-0.
6
Large language models (LLMs) in radiology exams for medical students: Performance and consequences.面向医学生的放射学考试中的大语言模型:表现与影响。
Rofo. 2024 Nov 4. doi: 10.1055/a-2437-2067.
7
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
8
Evaluation of error detection and treatment recommendations in nucleic acid test reports using ChatGPT models.使用ChatGPT模型评估核酸检测报告中的错误检测和治疗建议
Clin Chem Lab Med. 2025 Apr 21. doi: 10.1515/cclm-2025-0089.
9
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响:比较案例研究
JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.
10
Role of Artificial Intelligence in Surgical Training by Assessing GPT-4 and GPT-4o on the Japan Surgical Board Examination With Text-Only and Image-Accompanied Questions: Performance Evaluation Study.通过在日本外科医师资格考试中使用纯文本和图文并茂的问题评估GPT-4和GPT-4o来研究人工智能在外科培训中的作用:性能评估研究
JMIR Med Educ. 2025 Jul 30;11:e69313. doi: 10.2196/69313.

本文引用的文献

1
Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis.全球医学考试中的大语言模型:平台开发与综合分析
J Med Internet Res. 2024 Dec 27;26:e66114. doi: 10.2196/66114.
2
Assessment of ChatGPT-4 in Family Medicine Board Examinations Using Advanced AI Learning and Analytical Methods: Observational Study.使用高级 AI 学习和分析方法评估 ChatGPT-4 在家庭医学委员会考试中的表现:观察性研究。
JMIR Med Educ. 2024 Oct 8;10:e56128. doi: 10.2196/56128.
3
Medical Artificial Intelligence and Human Values.医学人工智能与人类价值观
N Engl J Med. 2024 May 30;390(20):1895-1904. doi: 10.1056/NEJMra2214183.
4
Proposing a Principle-Based Approach for Teaching AI Ethics in Medical Education.提出一种基于原则的医学教育中人工智能伦理教学方法。
JMIR Med Educ. 2024 Feb 9;10:e55368. doi: 10.2196/55368.
5
ChatGPT: performance of artificial intelligence in the dermatology specialty certificate examination.ChatGPT:人工智能在皮肤科专科证书考试中的表现。
An Bras Dermatol. 2024 Mar-Apr;99(2):277-279. doi: 10.1016/j.abd.2023.08.005. Epub 2023 Nov 18.
6
Teaching AI Ethics in Medical Education: A Scoping Review of Current Literature and Practices.医学教育中的人工智能伦理教学:当前文献和实践的范围综述。
Perspect Med Educ. 2023 Oct 16;12(1):399-410. doi: 10.5334/pme.954. eCollection 2023.
7
Exploring the role of ChatGPT in patient care (diagnosis and treatment) and medical research: A systematic review.探索ChatGPT在患者护理(诊断与治疗)及医学研究中的作用:一项系统综述。
Health Promot Perspect. 2023 Sep 11;13(3):183-191. doi: 10.34172/hpp.2023.22. eCollection 2023.
8
Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study.多伦多大学家庭医学住院医师进展测试中住院医师与人工智能聊天机器人表现的评估:比较研究
JMIR Med Educ. 2023 Sep 19;9:e50514. doi: 10.2196/50514.
9
The Emerging Role of Generative Artificial Intelligence in Medical Education, Research, and Practice.生成式人工智能在医学教育、研究和实践中的新兴作用。
Cureus. 2023 Jun 24;15(6):e40883. doi: 10.7759/cureus.40883. eCollection 2023 Jun.
10
Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.ChatGPT、GPT-4和谷歌巴德在神经外科口试准备题库上的表现。
Neurosurgery. 2023 Nov 1;93(5):1090-1098. doi: 10.1227/neu.0000000000002551. Epub 2023 Jun 12.