推进医学人工智能：GPT-4和GPT-4o在台湾医学执照考试中超越GPT-3.5。

Advancing medical AI: GPT-4 and GPT-4o surpass GPT-3.5 in Taiwanese medical licensing exams.

作者信息

Wu Yao-Cheng, Wu Yun-Chi, Chang Ya-Chuan, Yu Chia-Ying, Wu Chun-Lin, Sung Wen-Wei

机构信息

School of Medicine, Chung Shan Medical University, Taichung, Taiwan.

Department of Urology, Chung Shan Medical University Hospital, Taichung, Taiwan.

出版信息

PLoS One. 2025 Jun 4;20(6):e0324841. doi: 10.1371/journal.pone.0324841. eCollection 2025.

DOI:10.1371/journal.pone.0324841

PMID:40465748

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12136359/

Abstract

BACKGROUND

Chat Generative Pre-Trained Transformer (ChatGPT), launched by OpenAI in November 2022, features advanced large language models optimized for dialog. However, the performance differences between GPT-3.5, GPT-4, and GPT-4o in medical contexts remain unclear.

OBJECTIVE

This study evaluates the accuracy of GPT-3.5, GPT-4, and GPT-4o across various medical subjects. GPT-4o's performances in Chinese and English were also analyzed.

METHODS

We retrospectively compared GPT-3.5, GPT-4, and GPT-4o in Stage 1 of the Taiwanese Senior Professional and Technical Examinations for Medical Doctors (SPTEMD) from July 2021 to February 2024, excluding image-based questions.

RESULTS

The overall accuracy rates of GPT-3.5, GPT-4, and GPT-4o were 65.74% (781/1188), 95.71% (1137/1188), and 96.72% (1149/1188), respectively. GPT-4 and GPT-4o outperformed GPT-3.5 across all subjects. Statistical analysis revealed a significant difference between GPT-3.5 and the other models (p < 0.05) but no significant difference between GPT-4 and GPT-4o. Among subjects, physiology had a significantly higher error rate (p < 0.05) than the overall average across all three models. GPT-4o's accuracy rates in Chinese (98.14%) and English (98.48%) did not differ significantly.

CONCLUSIONS

GPT-4 and GPT-4o exceed the accuracy threshold for Taiwanese SPTEMD, demonstrating advancements in contextual comprehension and reasoning. Future research should focus on responsible integration into medical training and assessment.

摘要

背景

OpenAI于2022年11月推出的聊天生成预训练变换器（ChatGPT）具有针对对话优化的先进大语言模型。然而，GPT-3.5、GPT-4和GPT-4o在医学背景下的性能差异仍不明确。

目的

本研究评估GPT-3.5、GPT-4和GPT-4o在各种医学科目上的准确性。还分析了GPT-4o在中文和英文方面的表现。

方法

我们回顾性比较了GPT-3.5、GPT-4和GPT-4o在2021年7月至2024年2月台湾医师高级专业技术考试（SPTEMD）第一阶段中的表现，不包括基于图像的问题。

结果

GPT-3.5、GPT-4和GPT-4o的总体准确率分别为65.74%（781/1188）、95.71%（1137/1188）和96.72%（1149/1188）。在所有科目中，GPT-4和GPT-4o的表现均优于GPT-3.5。统计分析显示GPT-3.5与其他模型之间存在显著差异（p<0.05），但GPT-4和GPT-4o之间无显著差异。在所有科目中，生理学的错误率（p<0.05）显著高于这三种模型的总体平均错误率。GPT-4o在中文（98.14%）和英文（98.48%）方面的准确率无显著差异。

结论

GPT-4和GPT-4o超过了台湾SPTEMD的准确性阈值，表明在上下文理解和推理方面取得了进展。未来的研究应侧重于负责任地将其整合到医学培训和评估中。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

推进医学人工智能：GPT-4和GPT-4o在台湾医学执照考试中超越GPT-3.5。

Advancing medical AI: GPT-4 and GPT-4o surpass GPT-3.5 in Taiwanese medical licensing exams.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

本文引用的文献

推进医学人工智能：GPT-4和GPT-4o在台湾医学执照考试中超越GPT-3.5。

Advancing medical AI: GPT-4 and GPT-4o surpass GPT-3.5 in Taiwanese medical licensing exams.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

本文引用的文献