Wu Yao-Cheng, Wu Yun-Chi, Chang Ya-Chuan, Yu Chia-Ying, Wu Chun-Lin, Sung Wen-Wei
School of Medicine, Chung Shan Medical University, Taichung, Taiwan.
Department of Urology, Chung Shan Medical University Hospital, Taichung, Taiwan.
PLoS One. 2025 Jun 4;20(6):e0324841. doi: 10.1371/journal.pone.0324841. eCollection 2025.
Chat Generative Pre-Trained Transformer (ChatGPT), launched by OpenAI in November 2022, features advanced large language models optimized for dialog. However, the performance differences between GPT-3.5, GPT-4, and GPT-4o in medical contexts remain unclear.
This study evaluates the accuracy of GPT-3.5, GPT-4, and GPT-4o across various medical subjects. GPT-4o's performances in Chinese and English were also analyzed.
We retrospectively compared GPT-3.5, GPT-4, and GPT-4o in Stage 1 of the Taiwanese Senior Professional and Technical Examinations for Medical Doctors (SPTEMD) from July 2021 to February 2024, excluding image-based questions.
The overall accuracy rates of GPT-3.5, GPT-4, and GPT-4o were 65.74% (781/1188), 95.71% (1137/1188), and 96.72% (1149/1188), respectively. GPT-4 and GPT-4o outperformed GPT-3.5 across all subjects. Statistical analysis revealed a significant difference between GPT-3.5 and the other models (p < 0.05) but no significant difference between GPT-4 and GPT-4o. Among subjects, physiology had a significantly higher error rate (p < 0.05) than the overall average across all three models. GPT-4o's accuracy rates in Chinese (98.14%) and English (98.48%) did not differ significantly.
GPT-4 and GPT-4o exceed the accuracy threshold for Taiwanese SPTEMD, demonstrating advancements in contextual comprehension and reasoning. Future research should focus on responsible integration into medical training and assessment.
OpenAI于2022年11月推出的聊天生成预训练变换器(ChatGPT)具有针对对话优化的先进大语言模型。然而,GPT-3.5、GPT-4和GPT-4o在医学背景下的性能差异仍不明确。
本研究评估GPT-3.5、GPT-4和GPT-4o在各种医学科目上的准确性。还分析了GPT-4o在中文和英文方面的表现。
我们回顾性比较了GPT-3.5、GPT-4和GPT-4o在2021年7月至2024年2月台湾医师高级专业技术考试(SPTEMD)第一阶段中的表现,不包括基于图像的问题。
GPT-3.5、GPT-4和GPT-4o的总体准确率分别为65.74%(781/1188)、95.71%(1137/1188)和96.72%(1149/1188)。在所有科目中,GPT-4和GPT-4o的表现均优于GPT-3.5。统计分析显示GPT-3.5与其他模型之间存在显著差异(p<0.05),但GPT-4和GPT-4o之间无显著差异。在所有科目中,生理学的错误率(p<0.05)显著高于这三种模型的总体平均错误率。GPT-4o在中文(98.14%)和英文(98.48%)方面的准确率无显著差异。
GPT-4和GPT-4o超过了台湾SPTEMD的准确性阈值,表明在上下文理解和推理方面取得了进展。未来的研究应侧重于负责任地将其整合到医学培训和评估中。