Suppr超能文献

推进医学人工智能:GPT-4和GPT-4o在台湾医学执照考试中超越GPT-3.5。

Advancing medical AI: GPT-4 and GPT-4o surpass GPT-3.5 in Taiwanese medical licensing exams.

作者信息

Wu Yao-Cheng, Wu Yun-Chi, Chang Ya-Chuan, Yu Chia-Ying, Wu Chun-Lin, Sung Wen-Wei

机构信息

School of Medicine, Chung Shan Medical University, Taichung, Taiwan.

Department of Urology, Chung Shan Medical University Hospital, Taichung, Taiwan.

出版信息

PLoS One. 2025 Jun 4;20(6):e0324841. doi: 10.1371/journal.pone.0324841. eCollection 2025.

Abstract

BACKGROUND

Chat Generative Pre-Trained Transformer (ChatGPT), launched by OpenAI in November 2022, features advanced large language models optimized for dialog. However, the performance differences between GPT-3.5, GPT-4, and GPT-4o in medical contexts remain unclear.

OBJECTIVE

This study evaluates the accuracy of GPT-3.5, GPT-4, and GPT-4o across various medical subjects. GPT-4o's performances in Chinese and English were also analyzed.

METHODS

We retrospectively compared GPT-3.5, GPT-4, and GPT-4o in Stage 1 of the Taiwanese Senior Professional and Technical Examinations for Medical Doctors (SPTEMD) from July 2021 to February 2024, excluding image-based questions.

RESULTS

The overall accuracy rates of GPT-3.5, GPT-4, and GPT-4o were 65.74% (781/1188), 95.71% (1137/1188), and 96.72% (1149/1188), respectively. GPT-4 and GPT-4o outperformed GPT-3.5 across all subjects. Statistical analysis revealed a significant difference between GPT-3.5 and the other models (p < 0.05) but no significant difference between GPT-4 and GPT-4o. Among subjects, physiology had a significantly higher error rate (p < 0.05) than the overall average across all three models. GPT-4o's accuracy rates in Chinese (98.14%) and English (98.48%) did not differ significantly.

CONCLUSIONS

GPT-4 and GPT-4o exceed the accuracy threshold for Taiwanese SPTEMD, demonstrating advancements in contextual comprehension and reasoning. Future research should focus on responsible integration into medical training and assessment.

摘要

背景

OpenAI于2022年11月推出的聊天生成预训练变换器(ChatGPT)具有针对对话优化的先进大语言模型。然而,GPT-3.5、GPT-4和GPT-4o在医学背景下的性能差异仍不明确。

目的

本研究评估GPT-3.5、GPT-4和GPT-4o在各种医学科目上的准确性。还分析了GPT-4o在中文和英文方面的表现。

方法

我们回顾性比较了GPT-3.5、GPT-4和GPT-4o在2021年7月至2024年2月台湾医师高级专业技术考试(SPTEMD)第一阶段中的表现,不包括基于图像的问题。

结果

GPT-3.5、GPT-4和GPT-4o的总体准确率分别为65.74%(781/1188)、95.71%(1137/1188)和96.72%(1149/1188)。在所有科目中,GPT-4和GPT-4o的表现均优于GPT-3.5。统计分析显示GPT-3.5与其他模型之间存在显著差异(p<0.05),但GPT-4和GPT-4o之间无显著差异。在所有科目中,生理学的错误率(p<0.05)显著高于这三种模型的总体平均错误率。GPT-4o在中文(98.14%)和英文(98.48%)方面的准确率无显著差异。

结论

GPT-4和GPT-4o超过了台湾SPTEMD的准确性阈值,表明在上下文理解和推理方面取得了进展。未来的研究应侧重于负责任地将其整合到医学培训和评估中。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4664/12136359/dc9370643300/pone.0324841.g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验