Lockhart Kathleen, Canagasingham Ashan, Zhong Wenjie, Ashrafi Darius, March Brayden, Cole-Clark Dane, Grant Alice, Chung Amanda
Royal North Shore Hospital, St Leonards, New South Wales, Australia.
Wollongong Hospital, Wollongong, New South Wales, Australia.
BJU Int. 2025 Sep;136(3):523-528. doi: 10.1111/bju.16806. Epub 2025 Jun 19.
To assess the performance of ChatGPT compared to human trainees in the Australian Urology written fellowship examination (essay format).
Each examination was marked independently by two blinded examining urologists and assessed for: overall pass/failure; proportion of passing questions; and adjusted aggregate score. Examining urologists also made a blinded judgement as to authorship (artificial intelligence [AI] or trainee).
A total of 20 examination papers were marked; 10 completed by urology trainees and 10 by AI platforms (half each on ChatGPT-3.5 and -4.0). Overall, 9/10 of trainees successfully passed the urology fellowship, whereas only 6/10 of ChatGPT examinations passed (P = 0.3). Of the ChatGPT failing examinations, 3/4 were undertaken by the ChatGPT-3.5 platform. The proportion of passing questions per examination was higher in trainees compared to ChatGPT: mean 89.4% vs 80.9% (P = 0.2). The adjusted aggregate scores of trainees were also higher than those of ChatGPT by a small margin: mean 79.2% vs 78.1% (P = 0.8). ChatGPT-3.5 and ChatGPT-4.0 achieved similar aggregate scores (78.9% and 77.4%, P = 0.8). However, ChatGPT-3.5 had a lower percentage of passing questions per examination: mean 79.6% vs 82.1% (P = 0.8). Two examinations were incorrectly assigned by examining urologists (both trainee candidates perceived to be ChatGPT); therefore, the sensitivity for identifying ChatGPT authorship was 100% and overall accuracy was 91.7%.
Overall, ChatGPT did not perform as well as human trainees in the Australian Urology fellowship written examination. Examiners were able to identify AI-generated answers with a high degree of accuracy.
在澳大利亚泌尿外科书面专科医师考试(论文形式)中,评估ChatGPT与人类受训者的表现。
每次考试由两名不知情的泌尿外科考官独立评分,并评估以下内容:总体及格/不及格情况;及格问题的比例;以及调整后的总分。泌尿外科考官还对作者身份(人工智能[AI]或受训者)进行了不知情判断。
共评阅了20份试卷;10份由泌尿外科受训者完成,10份由AI平台完成(ChatGPT-3.5和-4.0各占一半)。总体而言,10名受训者中有9人成功通过了泌尿外科专科医师考试,而ChatGPT考试只有6人通过(P = 0.3)。在ChatGPT未通过的考试中,3/4是由ChatGPT-3.5平台进行的。与ChatGPT相比,受训者每次考试及格问题的比例更高:平均为89.4%对80.9%(P = 0.2)。受训者的调整后总分也略高于ChatGPT:平均为79.2%对78.1%(P = 0.8)。ChatGPT-3.5和ChatGPT-4.0的总分相似(78.9%和77.4%,P = 0.8)。然而,ChatGPT-3.5每次考试及格问题的百分比更低:平均为79.6%对82.1%(P = 0.8)。两名泌尿外科考官错误地分配了两份试卷(两名被认为是ChatGPT的受训者考生);因此,识别ChatGPT作者身份的敏感性为100%,总体准确率为91.7%。
总体而言,在澳大利亚泌尿外科专科医师书面考试中,ChatGPT的表现不如人类受训者。考官能够高度准确地识别由AI生成的答案。