关于ChatGPT在医学住院医师考试中表现的多模型纵向评估。

A multi-model longitudinal assessment of ChatGPT performance on medical residency examinations.

作者信息

Souto Maria Eduarda Varela Cavalcanti, Fernandes Alexandre Chaves, Silva Ana Beatriz Santana, de Freitas Ribeiro Louise Helena, de Medeiros Fernandes Thales Allyrio Araújo

机构信息

Department of Biomedical Sciences, School of Health Sciences, State University of Rio Grande do Norte, Mossoró, Brazil.

Institute of Mathematics and Computer Sciences, University of São Paulo, São Paulo, Brazil.

出版信息

Front Artif Intell. 2025 Aug 22;8:1614874. doi: 10.3389/frai.2025.1614874. eCollection 2025.

DOI:10.3389/frai.2025.1614874

PMID:40918587

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12411524/

Abstract

INTRODUCTION

ChatGPT, a generative artificial intelligence, has potential applications in numerous fields, including medical education. This potential can be assessed through its performance on medical exams. Medical residency exams, critical for entering medical specialties, serve as a valuable benchmark.

MATERIALS AND METHODS

This study aimed to assess the accuracy of ChatGPT-4 and GPT-4o in responding to 1,041 medical residency questions from Brazil, examining overall accuracy and performance across different medical areas, based on evaluations conducted in 2023 and 2024. The questions were classified into higher and lower cognitive levels according to Bloom's taxonomy. Additionally, questions answered incorrectly by both models were tested using the recent GPT models that use chain-of-thought reasoning (e.g., o1-preview, o3, o4-mini-high) with evaluations carried out in both 2024 and 2025.

RESULTS

GPT-4 achieved 81.27% accuracy (95% CI: 78.89-83.64%), while GPT-4o reached 85.88% (95% CI: 83.76-88.00%), significantly outperforming GPT-4 ( < 0.05). Both models showed reduced accuracy on higher-order thinking questions. On questions that both models failed, GPT o1-preview achieved 53.26% accuracy (95% CI: 42.87-63.65%), GPT o3 47.83% (95% CI: 37.42-58.23%) and o4-mini-high 35.87% (95% CI: 25.88-45.86%), with all three models performing better on higher-order questions.

CONCLUSION

Artificial intelligence could be a beneficial tool in medical education, enhancing residency exam preparation, helping to understand complex topics, and improving teaching strategies. However, careful use of artificial intelligence is essential due to ethical concerns and potential limitations in both educational and clinical practice.

摘要

引言

ChatGPT是一种生成式人工智能，在包括医学教育在内的众多领域都有潜在应用。这种潜力可以通过其在医学考试中的表现来评估。医学住院医师考试对于进入医学专科至关重要，是一个有价值的基准。

材料与方法

本研究旨在评估ChatGPT-4和GPT-4o对来自巴西的1041道医学住院医师考试问题的回答准确性，根据2023年和2024年进行的评估，考察整体准确性以及在不同医学领域的表现。这些问题根据布鲁姆分类法分为高认知水平和低认知水平。此外，对两个模型都答错的问题，使用采用思维链推理的最新GPT模型（如o1-preview、o3、o4-mini-high）进行测试，并在2024年和2025年进行评估。

结果

GPT-4的准确率为81.27%（95%置信区间：78.89-83.64%），而GPT-4o达到85.88%（95%置信区间：83.76-88.00%），显著优于GPT-4（<0.05）。两个模型在高阶思维问题上的准确率都有所降低。在两个模型都答错的问题上，GPT o1-preview的准确率为53.26%（95%置信区间：42.87-63.65%），GPT o3为47.83%（95%置信区间：37.42-58.23%），o4-mini-high为35.87%（95%置信区间：25.88-45.86%），这三个模型在高阶问题上的表现都更好。