ChatGPT 准备好正式登场了吗？人工智能在模拟加拿大泌尿外科委员会考试中的表现。

Is ChatGPT ready for primetime? Performance of artificial intelligence on a simulated Canadian urology board exam.

作者信息

Touma Naji J, Caterini Jessica, Liblk Kiera

机构信息

Queen's University, Kingston, ON, Canada.

出版信息

Can Urol Assoc J. 2024 Oct;18(10):329-332. doi: 10.5489/cuaj.8800.

DOI:10.5489/cuaj.8800

PMID:38896484

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11477513/

Abstract

INTRODUCTION

Generative artificial intelligence (AI) has proven to be a powerful tool with increasing applications in clinical care and medical education. ChatGPT has performed adequately on many specialty certification and knowledge assessment exams. The objective of this study was to assess the performance of ChatGPT 4 on a multiple-choice exam meant to simulate the Canadian urology board exam.

METHODS

Graduating urology residents representing all Canadian training programs gather yearly for a mock exam that simulates their upcoming board-certifying exam. The exam consists of written multiple-choice questions (MCQs) and an oral objective structured clinical examination (OSCE). The 2022 exam was taken by 29 graduating residents and was administered to ChatGPT 4.

RESULTS

ChatGPT 4 scored 46% on the MCQ exam, whereas the mean and median scores of graduating urology residents were 62.6%, and 62.7%, respectively. This would place ChatGPT's score 1.8 standard deviations from the median. The percentile rank of ChatGPT would be in the sixth percentile. ChatGPT scores on different topics of the exam were as follows: oncology 35%, andrology/benign prostatic hyperplasia 62%, physiology/anatomy 67%, incontinence/female urology 23%, infections 71%, urolithiasis 57%, and trauma/reconstruction 17%, with ChatGPT 4's oncology performance being significantly below that of postgraduate year 5 residents.

CONCLUSIONS

ChatGPT 4 underperforms on an MCQ exam meant to simulate the Canadian board exam. Ongoing assessments of the capability of generative AI is needed as these models evolve and are trained on additional urology content.

摘要

引言

生成式人工智能（AI）已被证明是一种强大的工具，在临床护理和医学教育中的应用越来越广泛。ChatGPT在许多专业认证和知识评估考试中表现良好。本研究的目的是评估ChatGPT 4在一场旨在模拟加拿大泌尿外科委员会考试的多项选择题考试中的表现。

方法

代表加拿大所有培训项目的即将毕业的泌尿外科住院医师每年都会参加一场模拟即将到来的委员会认证考试的模拟考试。该考试包括书面多项选择题（MCQ）和口头客观结构化临床考试（OSCE）。2022年的考试由29名即将毕业的住院医师参加，并让ChatGPT 4作答。

结果

ChatGPT 4在MCQ考试中的得分为46%，而即将毕业的泌尿外科住院医师的平均得分和中位数得分分别为62.6%和62.7%。这使得ChatGPT的分数比中位数低1.8个标准差。ChatGPT的百分位排名将处于第六百分位。ChatGPT在考试不同主题上的得分如下：肿瘤学35%，男科学/良性前列腺增生62%，生理学/解剖学67%，尿失禁/女性泌尿外科23%，感染71%，尿路结石57%，以及创伤/重建17%，ChatGPT 4在肿瘤学方面的表现明显低于五年级住院医师。

结论

ChatGPT 4在一场旨在模拟加拿大委员会考试的MCQ考试中表现不佳。随着这些模型的发展以及在更多泌尿外科内容上的训练，需要对生成式AI的能力进行持续评估。

相似文献

Is ChatGPT ready for primetime? Performance of artificial intelligence on a simulated Canadian urology board exam.ChatGPT 准备好正式登场了吗？人工智能在模拟加拿大泌尿外科委员会考试中的表现。

Can Urol Assoc J. 2024 Oct;18(10):329-332. doi: 10.5489/cuaj.8800.

Performance of ChatGPT-3.5 and ChatGPT-4 on the European Board of Urology (EBU) exams: a comparative analysis.ChatGPT-3.5 和 ChatGPT-4 在欧洲泌尿外科学会（EBU）考试中的表现：比较分析。

World J Urol. 2024 Jul 26;42(1):445. doi: 10.1007/s00345-024-05137-4.

Assessment of ChatGPT-4 in Family Medicine Board Examinations Using Advanced AI Learning and Analytical Methods: Observational Study.使用高级 AI 学习和分析方法评估 ChatGPT-4 在家庭医学委员会考试中的表现：观察性研究。

JMIR Med Educ. 2024 Oct 8;10:e56128. doi: 10.2196/56128.

Performance of ChatGPT on the Taiwan urology board examination: insights into current strengths and shortcomings.ChatGPT 在台湾泌尿科考试中的表现：洞察当前的优势和不足。

World J Urol. 2024 Apr 23;42(1):250. doi: 10.1007/s00345-024-04957-8.

Impact of a training program on the performance of graduating Canadian residents on a national urology exam: Results of the last 20 years.一项培训计划对加拿大即将毕业的住院医师在全国泌尿外科考试中的表现的影响：过去20年的结果。

Can Urol Assoc J. 2019 Feb;13(2):39-42. doi: 10.5489/cuaj.5386. Epub 2018 Jul 31.

ChatGPT Earns American Board Certification in Hand Surgery.ChatGPT 获得美国手部外科委员会认证。

Hand Surg Rehabil. 2024 Jun;43(3):101688. doi: 10.1016/j.hansur.2024.101688. Epub 2024 Mar 27.

Evaluating Artificial Intelligence Competency in Education: Performance of ChatGPT-4 in the American Registry of Radiologic Technologists (ARRT) Radiography Certification Exam.评估教育领域的人工智能能力：ChatGPT-4在美国放射技师注册处（ARRT）放射摄影认证考试中的表现。

Acad Radiol. 2025 Feb;32(2):597-603. doi: 10.1016/j.acra.2024.08.009. Epub 2024 Aug 16.

Probing artificial intelligence in neurosurgical training: ChatGPT takes a neurosurgical residents written exam.探索人工智能在神经外科培训中的应用：ChatGPT参加神经外科住院医师笔试。

Brain Spine. 2023 Nov 29;4:102715. doi: 10.1016/j.bas.2023.102715. eCollection 2024.

Performance of Language Models on the Family Medicine In-Training Exam.语言模型在家庭医学住院医师考试中的表现。

Fam Med. 2024 Oct;56(9):555-560. doi: 10.22454/FamMed.2024.233738. Epub 2024 Aug 12.

ChatGPT's performance in German OB/GYN exams - paving the way for AI-enhanced medical education and clinical practice.ChatGPT在德国妇产科考试中的表现——为人工智能强化医学教育和临床实践铺平道路。

Front Med (Lausanne). 2023 Dec 13;10:1296615. doi: 10.3389/fmed.2023.1296615. eCollection 2023.

引用本文的文献

Use of Artificial Intelligence Methods for Improved Diagnosis of Urinary Tract Infections and Urinary Stone Disease.使用人工智能方法改善尿路感染和尿路结石病的诊断

J Clin Med. 2025 Jul 12;14(14):4942. doi: 10.3390/jcm14144942.

Resident assessment in urology: ChatGPT, take the wheel?泌尿外科的住院医师评估：ChatGPT，来掌控局面？

Can Urol Assoc J. 2025 Jun;19(6):188. doi: 10.5489/cuaj.9256.

Artificial Intelligence Use in Medical Education: Best Practices and Future Directions.人工智能在医学教育中的应用：最佳实践与未来方向。

Curr Urol Rep. 2025 May 29;26(1):45. doi: 10.1007/s11934-025-01277-1.

Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.大型语言模型回答临床研究问题的准确性：系统评价与网络荟萃分析

J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.

Use of artificial intelligence-generated multiple-choice questions for the examination of surgical subspecialty residents Report of feasibility and psychometric analysis.使用人工智能生成的多项选择题对外科专科住院医师进行考核：可行性及心理测量学分析报告

Can Urol Assoc J. 2025 Jun;19(6):182-187. doi: 10.5489/cuaj.9020.

Artificial Intelligence can Facilitate Application of Risk Stratification Algorithms to Bladder Cancer Patient Case Scenarios.人工智能可促进风险分层算法在膀胱癌患者病例场景中的应用。

Clin Med Insights Oncol. 2024 Nov 17;18:11795549241296781. doi: 10.1177/11795549241296781. eCollection 2024.

Beyond the hype: Unveiling the challenges of large language models in urology.超越炒作：揭示大语言模型在泌尿外科领域的挑战

Can Urol Assoc J. 2024 Oct;18(10):333-334. doi: 10.5489/cuaj.8987.

本文引用的文献

The Breakthrough of Large Language Models Release for Medical Applications: 1-Year Timeline and Perspectives.大语言模型在医疗应用方面的突破：1 年时间线与展望。

J Med Syst. 2024 Feb 17;48(1):22. doi: 10.1007/s10916-024-02045-3.

Artificial Intelligence: Ready To Pass the European Board Examinations in Urology?人工智能：准备好通过欧洲泌尿外科委员会考试了吗？

Eur Urol Open Sci. 2024 Jan 30;60:44-46. doi: 10.1016/j.euros.2024.01.002. eCollection 2024 Feb.

ChatGPT in healthcare: A taxonomy and systematic review.ChatGPT 在医疗保健中的应用：分类法与系统综述。

Comput Methods Programs Biomed. 2024 Mar;245:108013. doi: 10.1016/j.cmpb.2024.108013. Epub 2024 Jan 15.

How does artificial intelligence master urological board examinations? A comparative analysis of different Large Language Models' accuracy and reliability in the 2022 In-Service Assessment of the European Board of Urology.人工智能如何掌握泌尿外科委员会考试？2022年欧洲泌尿外科委员会在职评估中不同大语言模型准确性和可靠性的比较分析。

World J Urol. 2024 Jan 10;42(1):20. doi: 10.1007/s00345-023-04749-6.

Virtual OSCE examinations during COVID-19 A 360 satisfaction assessment from examiners and candidates.2019冠状病毒病期间的虚拟客观结构化临床考试：考官和考生的360度满意度评估

Can Urol Assoc J. 2023 Oct;17(10):E315-E318. doi: 10.5489/cuaj.8332.

New Artificial Intelligence ChatGPT Performs Poorly on the 2022 Self-assessment Study Program for Urology.新的人工智能 ChatGPT 在 2022 年泌尿科自我评估研究项目中表现不佳。

Urol Pract. 2023 Jul;10(4):409-415. doi: 10.1097/UPJ.0000000000000406. Epub 2023 Jun 5.

ChatGPT Performance on the American Urological Association Self-assessment Study Program and the Potential Influence of Artificial Intelligence in Urologic Training.ChatGPT 在泌尿外科协会自我评估研究计划中的表现以及人工智能在泌尿外科培训中的潜在影响。

Urology. 2023 Jul;177:29-33. doi: 10.1016/j.urology.2023.05.010. Epub 2023 May 18.

Are ChatGPT and other pretrained language models good parasitologists?ChatGPT和其他预训练语言模型能成为优秀的寄生虫学家吗？

Trends Parasitol. 2023 May;39(5):314-316. doi: 10.1016/j.pt.2023.02.006. Epub 2023 Mar 3.

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现：使用大语言模型进行人工智能辅助医学教育的潜力。

PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.

How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.ChatGPT在美国医师执照考试（USMLE）中的表现如何？大语言模型对医学教育和知识评估的影响。

JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验