探讨 ChatGPT 版本 3.5、4 和 4 与 Vision 在智利医师执照考试中的表现：观察性研究。

Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study.

机构信息

Graduate School of Education, Stanford University, Stanford, CA, United States.

School of Medicine, Universidad de Chile, Santiago, Chile.

出版信息

JMIR Med Educ. 2024 Apr 29;10:e55048. doi: 10.2196/55048.

DOI:10.2196/55048

PMID:38686550

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11082432/

Abstract

BACKGROUND

The deployment of OpenAI's ChatGPT-3.5 and its subsequent versions, ChatGPT-4 and ChatGPT-4 With Vision (4V; also known as "GPT-4 Turbo With Vision"), has notably influenced the medical field. Having demonstrated remarkable performance in medical examinations globally, these models show potential for educational applications. However, their effectiveness in non-English contexts, particularly in Chile's medical licensing examinations-a critical step for medical practitioners in Chile-is less explored. This gap highlights the need to evaluate ChatGPT's adaptability to diverse linguistic and cultural contexts.

OBJECTIVE

This study aims to evaluate the performance of ChatGPT versions 3.5, 4, and 4V in the EUNACOM (Examen Único Nacional de Conocimientos de Medicina), a major medical examination in Chile.

METHODS

Three official practice drills (540 questions) from the University of Chile, mirroring the EUNACOM's structure and difficulty, were used to test ChatGPT versions 3.5, 4, and 4V. The 3 ChatGPT versions were provided 3 attempts for each drill. Responses to questions during each attempt were systematically categorized and analyzed to assess their accuracy rate.

RESULTS

All versions of ChatGPT passed the EUNACOM drills. Specifically, versions 4 and 4V outperformed version 3.5, achieving average accuracy rates of 79.32% and 78.83%, respectively, compared to 57.53% for version 3.5 (P<.001). Version 4V, however, did not outperform version 4 (P=.73), despite the additional visual capabilities. We also evaluated ChatGPT's performance in different medical areas of the EUNACOM and found that versions 4 and 4V consistently outperformed version 3.5. Across the different medical areas, version 3.5 displayed the highest accuracy in psychiatry (69.84%), while versions 4 and 4V achieved the highest accuracy in surgery (90.00% and 86.11%, respectively). Versions 3.5 and 4 had the lowest performance in internal medicine (52.74% and 75.62%, respectively), while version 4V had the lowest performance in public health (74.07%).

CONCLUSIONS

This study reveals ChatGPT's ability to pass the EUNACOM, with distinct proficiencies across versions 3.5, 4, and 4V. Notably, advancements in artificial intelligence (AI) have not significantly led to enhancements in performance on image-based questions. The variations in proficiency across medical fields suggest the need for more nuanced AI training. Additionally, the study underscores the importance of exploring innovative approaches to using AI to augment human cognition and enhance the learning process. Such advancements have the potential to significantly influence medical education, fostering not only knowledge acquisition but also the development of critical thinking and problem-solving skills among health care professionals.

摘要

背景

OpenAI 的 ChatGPT-3.5 及其后续版本，ChatGPT-4 和 ChatGPT-4 With Vision（4V；也称为“GPT-4 Turbo With Vision”）的推出，显著影响了医疗领域。这些模型在全球医学考试中表现出色，具有教育应用的潜力。然而，它们在非英语环境中的有效性，特别是在智利医学执照考试中的表现——这是智利医生的关键步骤——尚未得到充分探索。这一差距凸显了评估 ChatGPT 适应不同语言和文化环境能力的必要性。

目的

本研究旨在评估 ChatGPT 版本 3.5、4 和 4V 在智利主要医学考试 EUNACOM（智利全国医学知识统考）中的表现。

方法

使用智利大学的三个官方模拟练习（540 个问题），模拟 EUNACOM 的结构和难度，对 ChatGPT 版本 3.5、4 和 4V 进行测试。为每个练习提供了 3 次尝试，每个尝试中 ChatGPT 版本的回答都被系统地分类和分析，以评估其准确率。

结果

所有版本的 ChatGPT 都通过了 EUNACOM 练习。具体来说，版本 4 和 4V 优于版本 3.5，平均准确率分别为 79.32%和 78.83%，而版本 3.5 的准确率为 57.53%（P<.001）。然而，尽管增加了视觉功能，版本 4V 并没有优于版本 4（P=.73）。我们还评估了 ChatGPT 在 EUNACOM 不同医学领域的表现，发现版本 4 和 4V 始终优于版本 3.5。在不同的医学领域中，版本 3.5 在精神病学方面表现出最高的准确率（69.84%），而版本 4 和 4V 在外科方面表现出最高的准确率（分别为 90.00%和 86.11%）。版本 3.5 和 4 在内科方面的表现最低（分别为 52.74%和 75.62%），而版本 4V 在公共卫生方面的表现最低（74.07%）。

结论

本研究揭示了 ChatGPT 通过 EUNACOM 的能力，版本 3.5、4 和 4V 具有不同的优势。值得注意的是，人工智能（AI）的进步并没有显著提高其在基于图像问题上的表现。不同版本在医学领域的表现差异表明需要更细致的 AI 培训。此外，该研究强调了探索使用 AI 增强人类认知和增强学习过程的创新方法的重要性。这些进展有可能对医学教育产生重大影响，不仅促进知识获取，而且还促进医疗保健专业人员的批判性思维和解决问题能力的发展。