Suppr超能文献

GPT-3.5 和 GPT-4 在标准化美国泌尿科知识评估项目中的表现:一项描述性研究。

Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: a descriptive study.

机构信息

Department of Urology, Penn State Health Milton S. Hershey Medical Center, Hershey, PA, USA.

Penn State College of Medicine, Hershey, PA, USA.

出版信息

J Educ Eval Health Prof. 2024;21:17. doi: 10.3352/jeehp.2024.21.17. Epub 2024 Jul 8.

Abstract

PURPOSE

This study aimed to evaluate the performance of Chat Generative Pre-Trained Transformer (ChatGPT) with respect to standardized urology multiple-choice items in the United States.

METHODS

In total, 700 multiple-choice urology board exam-style items were submitted to GPT-3.5 and GPT-4, and responses were recorded. Items were categorized based on topic and question complexity (recall, interpretation, and problem-solving). The accuracy of GPT-3.5 and GPT-4 was compared across item types in February 2024.

RESULTS

GPT-4 answered 44.4% of items correctly compared to 30.9% for GPT-3.5 (P>0.0001). GPT-4 (vs. GPT-3.5) had higher accuracy with urologic oncology (43.8% vs. 33.9%, P=0.03), sexual medicine (44.3% vs. 27.8%, P=0.046), and pediatric urology (47.1% vs. 27.1%, P=0.012) items. Endourology (38.0% vs. 25.7%, P=0.15), reconstruction and trauma (29.0% vs. 21.0%, P=0.41), and neurourology (49.0% vs. 33.3%, P=0.11) items did not show significant differences in performance across versions. GPT-4 also outperformed GPT-3.5 with respect to recall (45.9% vs. 27.4%, P<0.00001), interpretation (45.6% vs. 31.5%, P=0.0005), and problem-solving (41.8% vs. 34.5%, P=0.56) type items. This difference was not significant for the higher-complexity items.

CONCLUSION

s: ChatGPT performs relatively poorly on standardized multiple-choice urology board exam-style items, with GPT-4 outperforming GPT-3.5. The accuracy was below the proposed minimum passing standards for the American Board of Urology's Continuing Urologic Certification knowledge reinforcement activity (60%). As artificial intelligence progresses in complexity, ChatGPT may become more capable and accurate with respect to board examination items. For now, its responses should be scrutinized.

摘要

目的

本研究旨在评估 ChatGPT 在回答美国标准化泌尿科多选题方面的表现。

方法

共有 700 道泌尿科 board exam-style 多选题提交给 GPT-3.5 和 GPT-4,并记录了答案。根据主题和问题复杂性(回忆、解释和解决问题)对项目进行了分类。2024 年 2 月,比较了 GPT-3.5 和 GPT-4 在不同项目类型中的准确性。

结果

与 GPT-3.5(30.9%)相比,GPT-4 答对了 44.4%的题目(P>0.0001)。与 GPT-3.5 相比,GPT-4 具有更高的泌尿科肿瘤学(43.8%比 33.9%,P=0.03)、性医学(44.3%比 27.8%,P=0.046)和小儿泌尿科(47.1%比 27.1%,P=0.012)项目的准确性。内镜泌尿外科(38.0%比 25.7%,P=0.15)、重建和创伤(29.0%比 21.0%,P=0.41)和神经泌尿外科(49.0%比 33.3%,P=0.11)项目在两个版本之间的性能没有显著差异。与 GPT-3.5 相比,GPT-4 在回忆(45.9%比 27.4%,P<0.00001)、解释(45.6%比 31.5%,P=0.0005)和解决问题(41.8%比 34.5%,P=0.56)类型的项目上表现更好。对于更复杂的项目,这种差异并不显著。

结论

ChatGPT 在标准化多选题中的表现相对较差,GPT-4 优于 GPT-3.5。准确率低于美国泌尿科委员会继续教育泌尿科认证知识强化活动(60%)的最低通过标准。随着人工智能的复杂性不断提高,ChatGPT 在回答考试题目方面可能会变得更有能力和更准确。目前,其答案应该受到仔细审查。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e9bc/11893186/cacb4c7d9d1a/jeehp-21-17f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验