• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

推进医学人工智能:GPT-4和GPT-4o在台湾医学执照考试中超越GPT-3.5。

Advancing medical AI: GPT-4 and GPT-4o surpass GPT-3.5 in Taiwanese medical licensing exams.

作者信息

Wu Yao-Cheng, Wu Yun-Chi, Chang Ya-Chuan, Yu Chia-Ying, Wu Chun-Lin, Sung Wen-Wei

机构信息

School of Medicine, Chung Shan Medical University, Taichung, Taiwan.

Department of Urology, Chung Shan Medical University Hospital, Taichung, Taiwan.

出版信息

PLoS One. 2025 Jun 4;20(6):e0324841. doi: 10.1371/journal.pone.0324841. eCollection 2025.

DOI:10.1371/journal.pone.0324841
PMID:40465748
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12136359/
Abstract

BACKGROUND

Chat Generative Pre-Trained Transformer (ChatGPT), launched by OpenAI in November 2022, features advanced large language models optimized for dialog. However, the performance differences between GPT-3.5, GPT-4, and GPT-4o in medical contexts remain unclear.

OBJECTIVE

This study evaluates the accuracy of GPT-3.5, GPT-4, and GPT-4o across various medical subjects. GPT-4o's performances in Chinese and English were also analyzed.

METHODS

We retrospectively compared GPT-3.5, GPT-4, and GPT-4o in Stage 1 of the Taiwanese Senior Professional and Technical Examinations for Medical Doctors (SPTEMD) from July 2021 to February 2024, excluding image-based questions.

RESULTS

The overall accuracy rates of GPT-3.5, GPT-4, and GPT-4o were 65.74% (781/1188), 95.71% (1137/1188), and 96.72% (1149/1188), respectively. GPT-4 and GPT-4o outperformed GPT-3.5 across all subjects. Statistical analysis revealed a significant difference between GPT-3.5 and the other models (p < 0.05) but no significant difference between GPT-4 and GPT-4o. Among subjects, physiology had a significantly higher error rate (p < 0.05) than the overall average across all three models. GPT-4o's accuracy rates in Chinese (98.14%) and English (98.48%) did not differ significantly.

CONCLUSIONS

GPT-4 and GPT-4o exceed the accuracy threshold for Taiwanese SPTEMD, demonstrating advancements in contextual comprehension and reasoning. Future research should focus on responsible integration into medical training and assessment.

摘要

背景

OpenAI于2022年11月推出的聊天生成预训练变换器(ChatGPT)具有针对对话优化的先进大语言模型。然而,GPT-3.5、GPT-4和GPT-4o在医学背景下的性能差异仍不明确。

目的

本研究评估GPT-3.5、GPT-4和GPT-4o在各种医学科目上的准确性。还分析了GPT-4o在中文和英文方面的表现。

方法

我们回顾性比较了GPT-3.5、GPT-4和GPT-4o在2021年7月至2024年2月台湾医师高级专业技术考试(SPTEMD)第一阶段中的表现,不包括基于图像的问题。

结果

GPT-3.5、GPT-4和GPT-4o的总体准确率分别为65.74%(781/1188)、95.71%(1137/1188)和96.72%(1149/1188)。在所有科目中,GPT-4和GPT-4o的表现均优于GPT-3.5。统计分析显示GPT-3.5与其他模型之间存在显著差异(p<0.05),但GPT-4和GPT-4o之间无显著差异。在所有科目中,生理学的错误率(p<0.05)显著高于这三种模型的总体平均错误率。GPT-4o在中文(98.14%)和英文(98.48%)方面的准确率无显著差异。

结论

GPT-4和GPT-4o超过了台湾SPTEMD的准确性阈值,表明在上下文理解和推理方面取得了进展。未来的研究应侧重于负责任地将其整合到医学培训和评估中。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4664/12136359/ffe2e491d235/pone.0324841.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4664/12136359/dc9370643300/pone.0324841.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4664/12136359/4f22099c1a2e/pone.0324841.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4664/12136359/f4143e2a6215/pone.0324841.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4664/12136359/806f70a932e2/pone.0324841.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4664/12136359/ffe2e491d235/pone.0324841.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4664/12136359/dc9370643300/pone.0324841.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4664/12136359/4f22099c1a2e/pone.0324841.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4664/12136359/f4143e2a6215/pone.0324841.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4664/12136359/806f70a932e2/pone.0324841.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4664/12136359/ffe2e491d235/pone.0324841.g005.jpg

相似文献

1
Advancing medical AI: GPT-4 and GPT-4o surpass GPT-3.5 in Taiwanese medical licensing exams.推进医学人工智能:GPT-4和GPT-4o在台湾医学执照考试中超越GPT-3.5。
PLoS One. 2025 Jun 4;20(6):e0324841. doi: 10.1371/journal.pone.0324841. eCollection 2025.
2
Evaluating the performance of GPT-3.5, GPT-4, and GPT-4o in the Chinese National Medical Licensing Examination.评估GPT-3.5、GPT-4和GPT-4o在中国国家医师资格考试中的表现。
Sci Rep. 2025 Apr 23;15(1):14119. doi: 10.1038/s41598-025-98949-2.
3
ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis.ChatGPT-4 在 USMLE 学科和临床技能中的全能表现:比较分析。
JMIR Med Educ. 2024 Nov 6;10:e63430. doi: 10.2196/63430.
4
An Evaluation of the Performance of OpenAI-o1 and GPT-4o in the Japanese National Examination for Physical Therapists.OpenAI-o1和GPT-4o在日本物理治疗师国家考试中的表现评估
Cureus. 2025 Jan 6;17(1):e76989. doi: 10.7759/cureus.76989. eCollection 2025 Jan.
5
Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination.评估先进的大型语言模型在医学知识方面的有效性:使用日本国家医学考试的比较研究。
Int J Med Inform. 2025 Jan;193:105673. doi: 10.1016/j.ijmedinf.2024.105673. Epub 2024 Oct 28.
6
Performance of ChatGPT-3.5 and ChatGPT-4 in the Taiwan National Pharmacist Licensing Examination: Comparative Evaluation Study.ChatGPT-3.5和ChatGPT-4在台湾国家药剂师执照考试中的表现:比较评估研究。
JMIR Med Educ. 2025 Jan 17;11:e56850. doi: 10.2196/56850.
7
Suitability of GPT-4o as an evaluator of cardiopulmonary resuscitation skills examinations.GPT-4o 作为心肺复苏技能考试评估者的适用性。
Resuscitation. 2024 Nov;204:110404. doi: 10.1016/j.resuscitation.2024.110404. Epub 2024 Sep 28.
8
Generative pre-trained transformer 4o (GPT-4o) in solving text-based multiple response questions for European Diploma in Radiology (EDiR): a comparative study with radiologists.生成式预训练变换器4o(GPT-4o)用于解答欧洲放射学文凭(EDiR)基于文本的多项选择题:与放射科医生的对比研究
Insights Imaging. 2025 Mar 22;16(1):66. doi: 10.1186/s13244-025-01941-7.
9
Performance of ChatGPT on Stage 1 of the Taiwanese medical licensing exam.ChatGPT在台湾医师执照考试第一阶段的表现。
Digit Health. 2024 Feb 16;10:20552076241233144. doi: 10.1177/20552076241233144. eCollection 2024 Jan-Dec.
10
Performance of DeepSeek-R1 and ChatGPT-4o on the Chinese National Medical Licensing Examination: A Comparative Study.DeepSeek-R1和ChatGPT-4o在中国国家医师资格考试中的表现:一项比较研究。
J Med Syst. 2025 Jun 3;49(1):74. doi: 10.1007/s10916-025-02213-z.

本文引用的文献

1
Custom GPTs Enhancing Performance and Evidence Compared with GPT-3.5, GPT-4, and GPT-4o? A Study on the Emergency Medicine Specialist Examination.与GPT-3.5、GPT-4和GPT-4o相比,定制生成式预训练变换器(Custom GPTs)在提升性能和证据方面如何?一项关于急诊医学专科考试的研究。
Healthcare (Basel). 2024 Aug 30;12(17):1726. doi: 10.3390/healthcare12171726.
2
Exploring the proficiency of ChatGPT-4: An evaluation of its performance in the Taiwan advanced medical licensing examination.探索ChatGPT-4的熟练度:对其在台湾高级医学执照考试中的表现进行评估。
Digit Health. 2024 Mar 5;10:20552076241237678. doi: 10.1177/20552076241237678. eCollection 2024 Jan-Dec.
3
Performance of ChatGPT on Stage 1 of the Taiwanese medical licensing exam.
ChatGPT在台湾医师执照考试第一阶段的表现。
Digit Health. 2024 Feb 16;10:20552076241233144. doi: 10.1177/20552076241233144. eCollection 2024 Jan-Dec.
4
Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study.GPT-3.5 和 GPT-4 与医学生在书面德语文凭考试中的表现比较:观察性研究。
JMIR Med Educ. 2024 Feb 8;10:e50965. doi: 10.2196/50965.
5
Performance of Generative Pretrained Transformer on the National Medical Licensing Examination in Japan.生成式预训练变换器在日本国家医师资格考试中的表现。
PLOS Digit Health. 2024 Jan 23;3(1):e0000433. doi: 10.1371/journal.pdig.0000433. eCollection 2024 Jan.
6
Performance of ChatGPT on Registered Nurse License Exam in Taiwan: A Descriptive Study.ChatGPT在台湾注册护士执照考试中的表现:一项描述性研究。
Healthcare (Basel). 2023 Oct 30;11(21):2855. doi: 10.3390/healthcare11212855.
7
Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study.ChatGPT在日本国家医师资格考试医学问题上的准确性:评估研究
JMIR Form Res. 2023 Oct 13;7:e48023. doi: 10.2196/48023.
8
ChatGPT Conquers the Saudi Medical Licensing Exam: Exploring the Accuracy of Artificial Intelligence in Medical Knowledge Assessment and Implications for Modern Medical Education.ChatGPT攻克沙特医学执照考试:探索人工智能在医学知识评估中的准确性及其对现代医学教育的影响
Cureus. 2023 Sep 11;15(9):e45043. doi: 10.7759/cureus.45043. eCollection 2023 Sep.
9
Performance and exploration of ChatGPT in medical examination, records and education in Chinese: Pave the way for medical AI.ChatGPT 在中文体检、病历和教育方面的表现和探索:为医疗 AI 铺平道路。
Int J Med Inform. 2023 Sep;177:105173. doi: 10.1016/j.ijmedinf.2023.105173. Epub 2023 Aug 4.
10
ChatGPT failed Taiwan's Family Medicine Board Exam.ChatGPT 未能通过台湾家庭医学专科医师甄试。
J Chin Med Assoc. 2023 Aug 1;86(8):762-766. doi: 10.1097/JCMA.0000000000000946. Epub 2023 Jun 9.