Touma Naji J, Caterini Jessica, Liblk Kiera
Queen's University, Kingston, ON, Canada.
Can Urol Assoc J. 2024 Oct;18(10):329-332. doi: 10.5489/cuaj.8800.
Generative artificial intelligence (AI) has proven to be a powerful tool with increasing applications in clinical care and medical education. ChatGPT has performed adequately on many specialty certification and knowledge assessment exams. The objective of this study was to assess the performance of ChatGPT 4 on a multiple-choice exam meant to simulate the Canadian urology board exam.
Graduating urology residents representing all Canadian training programs gather yearly for a mock exam that simulates their upcoming board-certifying exam. The exam consists of written multiple-choice questions (MCQs) and an oral objective structured clinical examination (OSCE). The 2022 exam was taken by 29 graduating residents and was administered to ChatGPT 4.
ChatGPT 4 scored 46% on the MCQ exam, whereas the mean and median scores of graduating urology residents were 62.6%, and 62.7%, respectively. This would place ChatGPT's score 1.8 standard deviations from the median. The percentile rank of ChatGPT would be in the sixth percentile. ChatGPT scores on different topics of the exam were as follows: oncology 35%, andrology/benign prostatic hyperplasia 62%, physiology/anatomy 67%, incontinence/female urology 23%, infections 71%, urolithiasis 57%, and trauma/reconstruction 17%, with ChatGPT 4's oncology performance being significantly below that of postgraduate year 5 residents.
ChatGPT 4 underperforms on an MCQ exam meant to simulate the Canadian board exam. Ongoing assessments of the capability of generative AI is needed as these models evolve and are trained on additional urology content.
生成式人工智能(AI)已被证明是一种强大的工具,在临床护理和医学教育中的应用越来越广泛。ChatGPT在许多专业认证和知识评估考试中表现良好。本研究的目的是评估ChatGPT 4在一场旨在模拟加拿大泌尿外科委员会考试的多项选择题考试中的表现。
代表加拿大所有培训项目的即将毕业的泌尿外科住院医师每年都会参加一场模拟即将到来的委员会认证考试的模拟考试。该考试包括书面多项选择题(MCQ)和口头客观结构化临床考试(OSCE)。2022年的考试由29名即将毕业的住院医师参加,并让ChatGPT 4作答。
ChatGPT 4在MCQ考试中的得分为46%,而即将毕业的泌尿外科住院医师的平均得分和中位数得分分别为62.6%和62.7%。这使得ChatGPT的分数比中位数低1.8个标准差。ChatGPT的百分位排名将处于第六百分位。ChatGPT在考试不同主题上的得分如下:肿瘤学35%,男科学/良性前列腺增生62%,生理学/解剖学67%,尿失禁/女性泌尿外科23%,感染71%,尿路结石57%,以及创伤/重建17%,ChatGPT 4在肿瘤学方面的表现明显低于五年级住院医师。
ChatGPT 4在一场旨在模拟加拿大委员会考试的MCQ考试中表现不佳。随着这些模型的发展以及在更多泌尿外科内容上的训练,需要对生成式AI的能力进行持续评估。