GPT-3.5 和 GPT-4 在标准化美国泌尿科知识评估项目中的表现：一项描述性研究。

Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: a descriptive study.

机构信息

Department of Urology, Penn State Health Milton S. Hershey Medical Center, Hershey, PA, USA.

Penn State College of Medicine, Hershey, PA, USA.

出版信息

J Educ Eval Health Prof. 2024;21:17. doi: 10.3352/jeehp.2024.21.17. Epub 2024 Jul 8.

DOI:10.3352/jeehp.2024.21.17

PMID:38977032

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11893186/

Abstract

PURPOSE

This study aimed to evaluate the performance of Chat Generative Pre-Trained Transformer (ChatGPT) with respect to standardized urology multiple-choice items in the United States.

METHODS

In total, 700 multiple-choice urology board exam-style items were submitted to GPT-3.5 and GPT-4, and responses were recorded. Items were categorized based on topic and question complexity (recall, interpretation, and problem-solving). The accuracy of GPT-3.5 and GPT-4 was compared across item types in February 2024.

RESULTS

GPT-4 answered 44.4% of items correctly compared to 30.9% for GPT-3.5 (P>0.0001). GPT-4 (vs. GPT-3.5) had higher accuracy with urologic oncology (43.8% vs. 33.9%, P=0.03), sexual medicine (44.3% vs. 27.8%, P=0.046), and pediatric urology (47.1% vs. 27.1%, P=0.012) items. Endourology (38.0% vs. 25.7%, P=0.15), reconstruction and trauma (29.0% vs. 21.0%, P=0.41), and neurourology (49.0% vs. 33.3%, P=0.11) items did not show significant differences in performance across versions. GPT-4 also outperformed GPT-3.5 with respect to recall (45.9% vs. 27.4%, P<0.00001), interpretation (45.6% vs. 31.5%, P=0.0005), and problem-solving (41.8% vs. 34.5%, P=0.56) type items. This difference was not significant for the higher-complexity items.

CONCLUSION

s: ChatGPT performs relatively poorly on standardized multiple-choice urology board exam-style items, with GPT-4 outperforming GPT-3.5. The accuracy was below the proposed minimum passing standards for the American Board of Urology's Continuing Urologic Certification knowledge reinforcement activity (60%). As artificial intelligence progresses in complexity, ChatGPT may become more capable and accurate with respect to board examination items. For now, its responses should be scrutinized.

摘要

目的

本研究旨在评估 ChatGPT 在回答美国标准化泌尿科多选题方面的表现。

方法

共有 700 道泌尿科 board exam-style 多选题提交给 GPT-3.5 和 GPT-4，并记录了答案。根据主题和问题复杂性（回忆、解释和解决问题）对项目进行了分类。2024 年 2 月，比较了 GPT-3.5 和 GPT-4 在不同项目类型中的准确性。

结果

与 GPT-3.5（30.9%）相比，GPT-4 答对了 44.4%的题目（P>0.0001）。与 GPT-3.5 相比，GPT-4 具有更高的泌尿科肿瘤学（43.8%比 33.9%，P=0.03）、性医学（44.3%比 27.8%，P=0.046）和小儿泌尿科（47.1%比 27.1%，P=0.012）项目的准确性。内镜泌尿外科（38.0%比 25.7%，P=0.15）、重建和创伤（29.0%比 21.0%，P=0.41）和神经泌尿外科（49.0%比 33.3%，P=0.11）项目在两个版本之间的性能没有显著差异。与 GPT-3.5 相比，GPT-4 在回忆（45.9%比 27.4%，P<0.00001）、解释（45.6%比 31.5%，P=0.0005）和解决问题（41.8%比 34.5%，P=0.56）类型的项目上表现更好。对于更复杂的项目，这种差异并不显著。

结论

ChatGPT 在标准化多选题中的表现相对较差，GPT-4 优于 GPT-3.5。准确率低于美国泌尿科委员会继续教育泌尿科认证知识强化活动（60%）的最低通过标准。随着人工智能的复杂性不断提高，ChatGPT 在回答考试题目方面可能会变得更有能力和更准确。目前，其答案应该受到仔细审查。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e9bc/11893186/cacb4c7d9d1a/jeehp-21-17f1.jpg

相似文献

Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: a descriptive study.GPT-3.5 和 GPT-4 在标准化美国泌尿科知识评估项目中的表现：一项描述性研究。

J Educ Eval Health Prof. 2024;21:17. doi: 10.3352/jeehp.2024.21.17. Epub 2024 Jul 8.

Performance of Progressive Generations of GPT on an Exam Designed for Certifying Physicians as Certified Clinical Densitometrists.GPT 各代产品在专为认证医师为认证临床骨密度技师而设计的考试中的表现。

J Clin Densitom. 2024 Apr-Jun;27(2):101480. doi: 10.1016/j.jocd.2024.101480. Epub 2024 Feb 17.

ChatGPT performance on the American Shoulder and Elbow Surgeons maintenance of certification exam.ChatGPT 在美肩肘外科医生认证考试维护部分的表现。

J Shoulder Elbow Surg. 2024 Sep;33(9):1888-1893. doi: 10.1016/j.jse.2024.02.029. Epub 2024 Apr 4.

ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis.ChatGPT-4 在 USMLE 学科和临床技能中的全能表现：比较分析。

JMIR Med Educ. 2024 Nov 6;10:e63430. doi: 10.2196/63430.

Performance of ChatGPT on Solving Orthopedic Board-Style Questions: A Comparative Analysis of ChatGPT 3.5 and ChatGPT 4.ChatGPT 在解决骨科 Board 风格问题方面的表现：ChatGPT 3.5 和 ChatGPT 4 的对比分析

Clin Orthop Surg. 2024 Aug;16(4):669-673. doi: 10.4055/cios23179. Epub 2024 Mar 7.

ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions.与医学知识问题相比，ChatGPT在USMLE风格的伦理问题上表现更差。

Appl Clin Inform. 2024 Oct;15(5):1049-1055. doi: 10.1055/a-2405-0138. Epub 2024 Aug 29.

Assessment of ChatGPT-4 in Family Medicine Board Examinations Using Advanced AI Learning and Analytical Methods: Observational Study.使用高级 AI 学习和分析方法评估 ChatGPT-4 在家庭医学委员会考试中的表现：观察性研究。

JMIR Med Educ. 2024 Oct 8;10:e56128. doi: 10.2196/56128.

Performance of ChatGPT-3.5 and ChatGPT-4 in the Taiwan National Pharmacist Licensing Examination: Comparative Evaluation Study.ChatGPT-3.5和ChatGPT-4在台湾国家药剂师执照考试中的表现：比较评估研究。

JMIR Med Educ. 2025 Jan 17;11:e56850. doi: 10.2196/56850.

GPT-4o’s competency in answering the simulated written European Board of Interventional Radiology exam compared to a medical student and experts in Germany and its ability to generate exam items on interventional radiology: a descriptive study.GPT-4o 在回答模拟的欧洲介入放射学委员会考试方面的能力与德国医学生和专家相比，以及其在介入放射学方面生成考试项目的能力：一项描述性研究。

J Educ Eval Health Prof. 2024;21:21. doi: 10.3352/jeehp.2024.21.21. Epub 2024 Aug 20.

Evaluation of Reliability, Repeatability, Robustness, and Confidence of GPT-3.5 and GPT-4 on a Radiology Board-style Examination.GPT-3.5 和 GPT-4 在放射学 Board 式考试中的可靠性、可重复性、稳健性和置信度评估。

Radiology. 2024 May;311(2):e232715. doi: 10.1148/radiol.232715.

引用本文的文献

ChatGPT's role in the rapidly evolving hematologic cancer landscape.ChatGPT在迅速演变的血液学癌症领域中的作用。

Future Sci OA. 2025 Dec;11(1):2546259. doi: 10.1080/20565623.2025.2546259. Epub 2025 Sep 3.

The performance of ChatGPT on medical image-based assessments and implications for medical education.ChatGPT在基于医学图像的评估中的表现及其对医学教育的影响。

BMC Med Educ. 2025 Aug 23;25(1):1192. doi: 10.1186/s12909-025-07752-0.

OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board-Style Questions.OpenAI的o1大语言模型在眼科委员会风格的问题上表现优于GPT-4o、Gemini 1.5 Flash和人类考生。

Ophthalmol Sci. 2025 Jun 6;5(6):100844. doi: 10.1016/j.xops.2025.100844. eCollection 2025 Nov-Dec.

A comparative analysis of DeepSeek R1, DeepSeek-R1-Lite, OpenAi o1 Pro, and Grok 3 performance on ophthalmology board-style questions.DeepSeek R1、DeepSeek-R1-Lite、OpenAi o1 Pro和Grok 3在眼科委员会式问题上的性能比较分析。

Sci Rep. 2025 Jul 2;15(1):23101. doi: 10.1038/s41598-025-08601-2.

Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis.大型语言模型回答临床研究问题的准确性：系统评价与网络荟萃分析

J Med Internet Res. 2025 Apr 30;27:e64486. doi: 10.2196/64486.

Assessing the performance of large language models (GPT-3.5 and GPT-4) and accurate clinical information for pediatric nephrology.评估大型语言模型（GPT-3.5和GPT-4）的性能以及儿科肾脏病学的准确临床信息。

Pediatr Nephrol. 2025 Mar 5. doi: 10.1007/s00467-025-06723-3.

Evaluating the Performance of ChatGPT4.0 Versus ChatGPT3.5 on the Hand Surgery Self-Assessment Exam: A Comparative Analysis of Performance on Image-Based Questions.评估ChatGPT4.0与ChatGPT3.5在手部外科自我评估考试中的表现：基于图像问题的表现比较分析

Cureus. 2025 Jan 16;17(1):e77550. doi: 10.7759/cureus.77550. eCollection 2025 Jan.

Artificial Intelligence can Facilitate Application of Risk Stratification Algorithms to Bladder Cancer Patient Case Scenarios.人工智能可促进风险分层算法在膀胱癌患者病例场景中的应用。

Clin Med Insights Oncol. 2024 Nov 17;18:11795549241296781. doi: 10.1177/11795549241296781. eCollection 2024.

本文引用的文献

Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations.ChatGPT和GPT-4在神经外科笔试中的表现。

Neurosurgery. 2023 Dec 1;93(6):1353-1365. doi: 10.1227/neu.0000000000002632. Epub 2023 Aug 15.

Can ChatGPT pass the thoracic surgery exam?ChatGPT 能通过胸外科考试吗？

Am J Med Sci. 2023 Oct;366(4):291-295. doi: 10.1016/j.amjms.2023.08.001. Epub 2023 Aug 6.

Quality of information and appropriateness of ChatGPT outputs for urology patients.针对泌尿外科患者的ChatGPT输出信息质量及适用性

Prostate Cancer Prostatic Dis. 2024 Mar;27(1):103-108. doi: 10.1038/s41391-023-00705-y. Epub 2023 Jul 29.

Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings.评估ChatGPT在眼科领域的表现：对其优缺点的分析。

Ophthalmol Sci. 2023 May 5;3(4):100324. doi: 10.1016/j.xops.2023.100324. eCollection 2023 Dec.

Can Artificial Intelligence Pass the American Board of Orthopaedic Surgery Examination? Orthopaedic Residents Versus ChatGPT.人工智能能通过美国骨科医师学会考试吗？骨科住院医师与ChatGPT的对比。

Clin Orthop Relat Res. 2023 Aug 1;481(8):1623-1630. doi: 10.1097/CORR.0000000000002704. Epub 2023 May 23.

ChatGPT Performance on the American Urological Association Self-assessment Study Program and the Potential Influence of Artificial Intelligence in Urologic Training.ChatGPT 在泌尿外科协会自我评估研究计划中的表现以及人工智能在泌尿外科培训中的潜在影响。

Urology. 2023 Jul;177:29-33. doi: 10.1016/j.urology.2023.05.010. Epub 2023 May 18.

ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models.ChatGPT走进手术室：在大语言模型时代评估GPT-4在外科教育与培训中的表现及其潜力。

Ann Surg Treat Res. 2023 May;104(5):269-273. doi: 10.4174/astr.2023.104.5.269. Epub 2023 Apr 28.

Performance of ChatGPT on the Plastic Surgery Inservice Training Examination.ChatGPT 在整形外科学在职培训考试中的表现。

Aesthet Surg J. 2023 Nov 16;43(12):NP1078-NP1082. doi: 10.1093/asj/sjad128.

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现：使用大语言模型进行人工智能辅助医学教育的潜力。

PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

GPT-3.5 和 GPT-4 在标准化美国泌尿科知识评估项目中的表现：一项描述性研究。

Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: a descriptive study.

机构信息

出版信息

PURPOSE

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献