Guo Allen Ao, Canagasingham Ashan, Rasiah Krishan, Chalasani Venu, Mundy Julie, Chung Amanda
Department of Urology, Royal North Shore Hospital, Sydney, Australia.
North Shore Urology Research Group, Sydney, Australia.
ANZ J Surg. 2025 Jul-Aug;95(7-8):1350-1355. doi: 10.1111/ans.70186. Epub 2025 May 30.
Large language models have undergone vast development in recent years. The advent of large language models such as ChatGPT may play an important role in enhancing future medical education.
To evaluate the accuracy and performance of ChatGPT in the Generic Surgical Sciences Examination, we constructed a sample examination used to assess ChatGPT. Questions were sourced from a past questions bank and formatted to mirror the structure and layout of the examination. The performance of ChatGPT was assessed based on a predefined answer key recorded earlier.
ChatGPT scored a total of 468 marks out of a maximum total of 644 marks, scoring a final percentage of 72.7% across all sections tested. ChatGPT performed best in the physiology section, scoring 77.9%, followed by pathology, scoring 75.0%, and scored lowest in the anatomy section with 66.3%. When scoring was analyzed by question type, it was identified that ChatGPT performed best in the type "A" questions (multiple choice), scoring a total of 75%, which was followed closely by its performance in type "X" questions (true or false), where ChatGPT scored 73.2%. However, ChatGPT only scored 43.8% when answering type "B" questions (establishing a relationship between two statements).
Our results demonstrate that ChatGPT completed the Generic Surgical Sciences Examination with accuracy exceeding the required threshold for a pass in this examination. However, the large language model struggled with certain question types and sections. Overall, further research regarding the utility of ChatGPT in surgical education is required, and caution should be exercised with its use, as it remains in its infancy stages.
近年来,大语言模型经历了巨大的发展。ChatGPT等大语言模型的出现可能在提升未来医学教育方面发挥重要作用。
为了评估ChatGPT在普通外科学考试中的准确性和表现,我们构建了一个用于评估ChatGPT的样本考试。问题来自过去的题库,并按照考试的结构和布局进行格式化。根据之前记录的预定义答案键评估ChatGPT的表现。
ChatGPT在满分644分的考试中总共获得了468分,在所有测试部分的最终得分率为72.7%。ChatGPT在生理学部分表现最佳,得分77.9%,其次是病理学,得分75.0%,在解剖学部分得分最低,为66.3%。按问题类型分析得分时,发现ChatGPT在“A”型问题(多项选择题)中表现最佳,总得分为75%,紧随其后的是在“X”型问题(判断题)中的表现,ChatGPT得分为73.2%。然而,ChatGPT在回答“B”型问题(建立两个陈述之间的关系)时仅得43.8%。
我们的结果表明,ChatGPT完成了普通外科学考试,其准确性超过了该考试及格所需的阈值。然而,这个大语言模型在某些问题类型和部分存在困难。总体而言,需要进一步研究ChatGPT在外科教育中的实用性,并且在使用时应谨慎,因为它仍处于起步阶段。