Suppr超能文献

虽然GPT-3.5无法通过台湾的医师执照考试,但GPT-4成功达到了标准。

While GPT-3.5 is unable to pass the Physician Licensing Exam in Taiwan, GPT-4 successfully meets the criteria.

作者信息

Chen Tsung-An, Lin Kuan-Chen, Lin Ming-Hwai, Chang Hsiao-Ting, Chen Yu-Chun, Chen Tzeng-Ji

机构信息

Department of Family Medicine, Taipei Veterans General Hospital, Taipei, Taiwan, ROC.

School of Medicine, National Yang Ming Chiao Tung University, Taipei, Taiwan, ROC.

出版信息

J Chin Med Assoc. 2025 May 1;88(5):352-360. doi: 10.1097/JCMA.0000000000001225. Epub 2025 Mar 14.

Abstract

BACKGROUND

This study investigates the performance of ChatGPT-3.5 and ChatGPT-4 in answering medical questions from Taiwan's Physician Licensing Exam, ranging from basic medical knowledge to specialized clinical topics. It aims to understand these artificial intelligence (AI) models' capabilities in a non-English context, specifically traditional Chinese.

METHODS

The study incorporated questions from the Taiwan Physician Licensing Exam in 2022, excluding image-based queries. Each question was manually input into ChatGPT, and responses were compared with official answers from Taiwan's Ministry of Examination. Differences across specialties and question types were assessed using the Kruskal-Wallis and Fisher's exact tests.

RESULTS

ChatGPT-3.5 achieved an average accuracy of 67.7% in basic medical sciences and 53.2% in clinical medicine. Meanwhile, ChatGPT-4 significantly outperformed ChatGPT-3.5, with average accuracies of 91.9% and 90.7%, respectively. ChatGPT-3.5 scored above 60.0% in seven out of 10 basic medical science subjects and three of 14 clinical subjects, while ChatGPT-4 scored above 60.0% in every subject. The type of question did not significantly affect accuracy rates.

CONCLUSION

ChatGPT-3.5 showed proficiency in basic medical sciences but was less reliable in clinical medicine, whereas ChatGPT-4 demonstrated strong capabilities in both areas. However, their proficiency varied across different specialties. The type of question had minimal impact on performance. This study highlights the potential of AI models in medical education and non-English languages examination and the need for cautious and informed implementation in educational settings due to variability across specialties.

摘要

背景

本研究调查了ChatGPT-3.5和ChatGPT-4在回答台湾医师执照考试医学问题方面的表现,这些问题涵盖从基础医学知识到专业临床主题。其目的是了解这些人工智能(AI)模型在非英语环境(特别是繁体中文)中的能力。

方法

该研究纳入了2022年台湾医师执照考试的问题,不包括基于图像的查询。每个问题都手动输入到ChatGPT中,并将回答与台湾考试院的官方答案进行比较。使用Kruskal-Wallis检验和Fisher精确检验评估不同专业和问题类型之间的差异。

结果

ChatGPT-3.5在基础医学科学方面的平均准确率为67.7%,在临床医学方面为53.2%。同时,ChatGPT-4的表现明显优于ChatGPT-3.5,平均准确率分别为91.9%和90.7%。ChatGPT-3.5在10门基础医学科目中的7门以及14门临床科目中的3门得分高于60.0%,而ChatGPT-4在每个科目中的得分均高于60.0%。问题类型对准确率没有显著影响。

结论

ChatGPT-3.5在基础医学科学方面表现出一定能力,但在临床医学方面可靠性较低,而ChatGPT-4在这两个领域都展现出强大能力。然而,它们在不同专业的表现有所不同。问题类型对性能的影响最小。本研究强调了AI模型在医学教育和非英语语言考试中的潜力,以及由于各专业表现存在差异,在教育环境中谨慎和明智实施的必要性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验