ChatGPT在葡萄牙国家住院医师准入考试中的表现。

Performance of ChatGPT in the Portuguese National Residency Access Examination.

作者信息

Ferraz-Costa Gonçalo, Griné Mafalda, Oliveira-Santos Manuel, Teixeira Rogério

机构信息

Cardiology Department. Unidade Local de Saúde de Coimbra. Coimbra; Faculdade de Medicina. Universidade de Coimbra. Coimbra; Coimbra Institute for Clinical and Biomedical Research (iCBR). Coimbra. Portugal.

Cardiology Department. Unidade Local de Saúde de Coimbra. Coimbra; Coimbra Institute for Clinical and Biomedical Research (iCBR). Coimbra. Portugal.

出版信息

Acta Med Port. 2025 Mar 3;38(3):170-174. doi: 10.20344/amp.22506. Epub 2024 Dec 20.

DOI:10.20344/amp.22506

PMID:39704711

Abstract

ChatGPT, a language model developed by OpenAI, has been tested in several medical board examinations. This study aims to evaluate the performance of ChatGPT on the Portuguese National Residency Access Examination, a mandatory test for medical residency in Portugal. The study specifically compares the capabilities of ChatGPT versions 3.5 and 4o across five examination editions from 2019 to 2023. A total of 750 multiple-choice questions were submitted to both versions, and their answers were evaluated against the official responses. The findings revealed that ChatGPT 4o significantly outperformed ChatGPT 3.5, with a median examination score of 127 compared to 106 (p = 0.048). Notably, ChatGPT 4o achieved scores within the top 1% in two examination editions and exceeded the median performance of human candidates in all editions. Additionally, ChatGPT 4o's scores were high enough to qualify for any specialty. In conclusion, ChatGPT 4o can be a valuable tool for medical education and decision-making, but human oversight remains essential to ensure safe and accurate clinical practice.

摘要

ChatGPT是OpenAI开发的一种语言模型，已经在多项医学委员会考试中进行了测试。本研究旨在评估ChatGPT在葡萄牙国家住院医师准入考试中的表现，这是葡萄牙医学住院医师的一项强制性考试。该研究特别比较了ChatGPT 3.5版和4o版在2019年至2023年五个考试版本中的能力。总共向两个版本提交了750道多项选择题，并将它们的答案与官方答案进行了评估。研究结果显示，ChatGPT 4o的表现明显优于ChatGPT 3.5，其中位考试成绩为127分，而ChatGPT 3.5为106分（p = 0.048）。值得注意的是，ChatGPT 4o在两个考试版本中的成绩都在前1%以内，并且在所有版本中都超过了人类考生的中位表现。此外，ChatGPT 4o的分数足够高，可以符合任何专业的要求。总之，ChatGPT 4o可以成为医学教育和决策的一个有价值的工具，但人为监督对于确保安全准确的临床实践仍然至关重要。