评估ChatGPT-3.5和ChatGPT-4在台湾整形外科医师资格考试中的表现。

Evaluating the performance of ChatGPT-3.5 and ChatGPT-4 on the Taiwan plastic surgery board examination.

作者信息

Hsieh Ching-Hua, Hsieh Hsiao-Yun, Lin Hui-Ping

机构信息

Department of Plastic Surgery, Kaohsiung Chang Gung Memorial Hospital, Chang Gung University and College of Medicine, Kaohsiung, 83301, Taiwan.

出版信息

Heliyon. 2024 Jul 18;10(14):e34851. doi: 10.1016/j.heliyon.2024.e34851. eCollection 2024 Jul 30.

DOI:10.1016/j.heliyon.2024.e34851

PMID:39149010

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11324965/

Abstract

BACKGROUND

Chat Generative Pre-Trained Transformer (ChatGPT) is a state-of-the-art large language model that has been evaluated across various medical fields, with mixed performance on licensing examinations. This study aimed to assess the performance of ChatGPT-3.5 and ChatGPT-4 in answering questions from the Taiwan Plastic Surgery Board Examination.

METHODS

The study evaluated the performance of ChatGPT-3.5 and ChatGPT-4 on 1375 questions from the past 8 years of the Taiwan Plastic Surgery Board Examination, including 985 single-choice and 390 multiple-choice questions. We obtained the responses between June and July 2023, launching a new chat session for each question to eliminate memory retention bias.

RESULTS

Overall, ChatGPT-4 outperformed ChatGPT-3.5, achieving a 59 % correct answer rate compared to 41 % for ChatGPT-3.5. ChatGPT-4 passed five out of eight yearly exams, whereas ChatGPT-3.5 failed all. On single-choice questions, ChatGPT-4 scored 66 % correct, compared to 48 % for ChatGPT-3.5. On multiple-choice, ChatGPT-4 achieved a 43 % correct rate, nearly double the 23 % of ChatGPT-3.5.

CONCLUSION

As ChatGPT evolves, its performance on the Taiwan Plastic Surgery Board Examination is expected to improve further. The study suggests potential reforms, such as incorporating more problem-based scenarios, leveraging ChatGPT to refine exam questions, and integrating AI-assisted learning into candidate preparation. These advancements could enhance the assessment of candidates' critical thinking and problem-solving abilities in the field of plastic surgery.

摘要

背景

聊天生成预训练变换器（ChatGPT）是一种先进的大型语言模型，已在各个医学领域进行了评估，在执照考试中的表现参差不齐。本研究旨在评估ChatGPT-3.5和ChatGPT-4在回答台湾整形外科委员会考试问题方面的表现。

方法

该研究评估了ChatGPT-3.5和ChatGPT-4对台湾整形外科委员会考试过去8年的1375道问题的表现，包括985道单项选择题和390道多项选择题。我们在2023年6月至7月期间获得了回答，为每个问题开启一个新的聊天会话，以消除记忆保留偏差。

结果

总体而言，ChatGPT-4的表现优于ChatGPT-3.5，正确答案率达到59%，而ChatGPT-3.5为41%。ChatGPT-4在八项年度考试中的五项中通过，而ChatGPT-3.5全部未通过。在单项选择题上，ChatGPT-4的正确率为66%，而ChatGPT-3.5为48%。在多项选择题上，ChatGPT-4的正确率为43%，几乎是ChatGPT-3.5的23%的两倍。

结论

随着ChatGPT的不断发展，其在台湾整形外科委员会考试中的表现有望进一步提高。该研究提出了潜在的改革建议，例如纳入更多基于问题的情景、利用ChatGPT完善考试题目，以及将人工智能辅助学习纳入考生备考。这些进步可以加强对整形外科领域考生批判性思维和解决问题能力的评估。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

评估ChatGPT-3.5和ChatGPT-4在台湾整形外科医师资格考试中的表现。

Evaluating the performance of ChatGPT-3.5 and ChatGPT-4 on the Taiwan plastic surgery board examination.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSION

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

相似文献

引用本文的文献

本文引用的文献

评估ChatGPT-3.5和ChatGPT-4在台湾整形外科医师资格考试中的表现。

Evaluating the performance of ChatGPT-3.5 and ChatGPT-4 on the Taiwan plastic surgery board examination.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSION

背景

方法

结果

结论