• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

新的人工智能 ChatGPT 在 2022 年泌尿科自我评估研究项目中表现不佳。

New Artificial Intelligence ChatGPT Performs Poorly on the 2022 Self-assessment Study Program for Urology.

机构信息

MD/PhD Scholars Program, University of Nebraska Medical Center, Omaha, Nebraska.

College of Medicine, University of Nebraska Medical Center, Omaha, Nebraska.

出版信息

Urol Pract. 2023 Jul;10(4):409-415. doi: 10.1097/UPJ.0000000000000406. Epub 2023 Jun 5.

DOI:10.1097/UPJ.0000000000000406
PMID:37276372
Abstract

INTRODUCTION

Large language models have demonstrated impressive capabilities, but application to medicine remains unclear. We seek to evaluate the use of ChatGPT on the American Urological Association Self-assessment Study Program as an educational adjunct for urology trainees and practicing physicians.

METHODS

One hundred fifty questions from the 2022 Self-assessment Study Program exam were screened, and those containing visual assets (n=15) were removed. The remaining items were encoded as open ended or multiple choice. ChatGPT's output was coded as correct, incorrect, or indeterminate; if indeterminate, responses were regenerated up to 2 times. Concordance, quality, and accuracy were ascertained by 3 independent researchers and reviewed by 2 physician adjudicators. A new session was started for each entry to avoid crossover learning.

RESULTS

ChatGPT was correct on 36/135 (26.7%) open-ended and 38/135 (28.2%) multiple-choice questions. Indeterminate responses were generated in 40 (29.6%) and 4 (3.0%), respectively. Of the correct responses, 24/36 (66.7%) and 36/38 (94.7%) were on initial output, 8 (22.2%) and 1 (2.6%) on second output, and 4 (11.1%) and 1 (2.6%) on final output, respectively. Although regeneration decreased indeterminate responses, proportion of correct responses did not increase. For open-ended and multiple-choice questions, ChatGPT provided consistent justifications for incorrect answers and remained concordant between correct and incorrect answers.

CONCLUSIONS

ChatGPT previously demonstrated promise on medical licensing exams; however, application to the 2022 Self-assessment Study Program was not demonstrated. Performance improved with multiple-choice over open-ended questions. More importantly were the persistent justifications for incorrect responses-left unchecked, utilization of ChatGPT in medicine may facilitate medical misinformation.

摘要

简介

大型语言模型展现出了令人印象深刻的能力,但在医学领域的应用仍不明确。我们旨在评估 ChatGPT 在 2022 年美国泌尿外科学会自我评估学习计划中的应用,将其作为泌尿科住院医师和执业医师的教育辅助工具。

方法

对 2022 年自我评估学习计划考试的 150 个问题进行了筛选,去除了包含视觉资产的问题(n=15)。剩余的问题被编码为开放式或多项选择题。ChatGPT 的输出被编码为正确、错误或不确定;如果不确定,则会重新生成最多 2 次响应。由 3 位独立研究人员确定一致性、质量和准确性,并由 2 位医师裁判进行审查。为避免交叉学习,每个条目都会启动一个新的会话。

结果

ChatGPT 在 36/135 个开放式和 38/135 个多项选择题中是正确的。不确定的回答分别产生了 40(29.6%)和 4(3.0%)。在正确的回答中,24/36(66.7%)和 36/38(94.7%)是首次输出,8/36(22.2%)和 1/38(2.6%)是第二次输出,4/36(11.1%)和 1/38(2.6%)是最终输出。虽然重新生成减少了不确定的回答,但正确回答的比例并没有增加。对于开放式和多项选择题,ChatGPT 为错误答案提供了一致的解释,并且在正确和错误答案之间保持一致。

结论

ChatGPT 之前在医学执照考试中表现出了潜力;然而,在 2022 年的自我评估学习计划中并没有得到应用。多项选择题的表现优于开放式问题。更重要的是,错误答案的持续解释——如果不加以检查,在医学中使用 ChatGPT 可能会促进医学错误信息的传播。

相似文献

1
New Artificial Intelligence ChatGPT Performs Poorly on the 2022 Self-assessment Study Program for Urology.新的人工智能 ChatGPT 在 2022 年泌尿科自我评估研究项目中表现不佳。
Urol Pract. 2023 Jul;10(4):409-415. doi: 10.1097/UPJ.0000000000000406. Epub 2023 Jun 5.
2
Assessing question characteristic influences on ChatGPT's performance and response-explanation consistency: Insights from Taiwan's Nursing Licensing Exam.评估问题特征对 ChatGPT 表现和回应解释一致性的影响:来自台湾护理执照考试的见解。
Int J Nurs Stud. 2024 May;153:104717. doi: 10.1016/j.ijnurstu.2024.104717. Epub 2024 Feb 8.
3
Performance of ChatGPT on the Chinese Postgraduate Examination for Clinical Medicine: Survey Study.ChatGPT 在临床医学研究生入学考试中的表现:调查研究。
JMIR Med Educ. 2024 Feb 9;10:e48514. doi: 10.2196/48514.
4
ChatGPT Performance on the American Urological Association Self-assessment Study Program and the Potential Influence of Artificial Intelligence in Urologic Training.ChatGPT 在泌尿外科协会自我评估研究计划中的表现以及人工智能在泌尿外科培训中的潜在影响。
Urology. 2023 Jul;177:29-33. doi: 10.1016/j.urology.2023.05.010. Epub 2023 May 18.
5
Performance of ChatGPT on the Taiwan urology board examination: insights into current strengths and shortcomings.ChatGPT 在台湾泌尿科考试中的表现:洞察当前的优势和不足。
World J Urol. 2024 Apr 23;42(1):250. doi: 10.1007/s00345-024-04957-8.
6
Performance of ChatGPT-3.5 and ChatGPT-4 on the European Board of Urology (EBU) exams: a comparative analysis.ChatGPT-3.5 和 ChatGPT-4 在欧洲泌尿外科学会(EBU)考试中的表现:比较分析。
World J Urol. 2024 Jul 26;42(1):445. doi: 10.1007/s00345-024-05137-4.
7
Use of ChatGPT in Urology and its Relevance in Clinical Practice: Is it useful?ChatGPT 在泌尿外科中的应用及其在临床实践中的相关性:它有用吗?
Int Braz J Urol. 2024 Mar-Apr;50(2):192-198. doi: 10.1590/S1677-5538.IBJU.2023.0570.
8
ChatGPT's performance in German OB/GYN exams - paving the way for AI-enhanced medical education and clinical practice.ChatGPT在德国妇产科考试中的表现——为人工智能强化医学教育和临床实践铺平道路。
Front Med (Lausanne). 2023 Dec 13;10:1296615. doi: 10.3389/fmed.2023.1296615. eCollection 2023.
9
How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment.ChatGPT在美国医师执照考试(USMLE)中的表现如何?大语言模型对医学教育和知识评估的影响。
JMIR Med Educ. 2023 Feb 8;9:e45312. doi: 10.2196/45312.
10
ChatGPT's Performance on the Hand Surgery Self-Assessment Exam: A Critical Analysis.ChatGPT在手外科自我评估考试中的表现:一项批判性分析。
J Hand Surg Glob Online. 2024 Jan 2;6(2):200-205. doi: 10.1016/j.jhsg.2023.11.014. eCollection 2024 Mar.

引用本文的文献

1
Perceptions and Earliest Experiences of Medical Students and Faculty With ChatGPT in Medical Education: Qualitative Study.医学生和教师在医学教育中对ChatGPT的认知与早期体验:定性研究
JMIR Med Educ. 2025 Feb 20;11:e63400. doi: 10.2196/63400.
2
Performance of artificial intelligence on Turkish dental specialization exam: can ChatGPT-4.0 and gemini advanced achieve comparable results to humans?人工智能在土耳其牙科专业考试中的表现:ChatGPT-4.0和Gemini Advanced能否取得与人类相当的成绩?
BMC Med Educ. 2025 Feb 10;25(1):214. doi: 10.1186/s12909-024-06389-9.
3
Exploring ChatGPT in clinical inquiry: a scoping review of characteristics, applications, challenges, and evaluation.
探索临床问诊中的ChatGPT:特征、应用、挑战及评估的范围综述
Ann Med Surg (Lond). 2024 Nov 8;86(12):7094-7104. doi: 10.1097/MS9.0000000000002716. eCollection 2024 Dec.
4
Analyzing evaluation methods for large language models in the medical field: a scoping review.分析医学领域大语言模型的评价方法:范围综述。
BMC Med Inform Decis Mak. 2024 Nov 29;24(1):366. doi: 10.1186/s12911-024-02709-7.
5
The Accuracy and Capability of Artificial Intelligence Solutions in Health Care Examinations and Certificates: Systematic Review and Meta-Analysis.人工智能解决方案在医疗检查和证书中的准确性和能力:系统评价和荟萃分析。
J Med Internet Res. 2024 Nov 5;26:e56532. doi: 10.2196/56532.
6
ChatGPT-3.5 Versus Google Bard: Which Large Language Model Responds Best to Commonly Asked Pregnancy Questions?ChatGPT-3.5与谷歌巴德:哪种大语言模型对常见的怀孕问题回答得最好?
Cureus. 2024 Jul 27;16(7):e65543. doi: 10.7759/cureus.65543. eCollection 2024 Jul.
7
Performance of ChatGPT-3.5 and ChatGPT-4 on the European Board of Urology (EBU) exams: a comparative analysis.ChatGPT-3.5 和 ChatGPT-4 在欧洲泌尿外科学会(EBU)考试中的表现:比较分析。
World J Urol. 2024 Jul 26;42(1):445. doi: 10.1007/s00345-024-05137-4.
8
Amplifying Chinese physicians' emphasis on patients' psychological states beyond urologic diagnoses with ChatGPT - a multicenter cross-sectional study.利用ChatGPT强化中国医生在泌尿外科诊断之外对患者心理状态的重视——一项多中心横断面研究
Int J Surg. 2024 Oct 1;110(10):6501-6508. doi: 10.1097/JS9.0000000000001775.
9
Is ChatGPT ready for primetime? Performance of artificial intelligence on a simulated Canadian urology board exam.ChatGPT 准备好正式登场了吗?人工智能在模拟加拿大泌尿外科委员会考试中的表现。
Can Urol Assoc J. 2024 Oct;18(10):329-332. doi: 10.5489/cuaj.8800.
10
Diagnosis in Bytes: Comparing the Diagnostic Accuracy of Google and ChatGPT 3.5 as an Educational Support Tool.诊断字节:比较谷歌和 ChatGPT 3.5 作为教育支持工具的诊断准确性。
Int J Environ Res Public Health. 2024 May 1;21(5):580. doi: 10.3390/ijerph21050580.