Suppr超能文献

评估 AI 聊天机器人在泌尿科教学中的疗效:对 2022 年欧洲泌尿外科学会在职评估的回应进行比较分析。

Evaluating the Efficacy of AI Chatbots as Tutors in Urology: A Comparative Analysis of Responses to the 2022 In-Service Assessment of the European Board of Urology.

机构信息

Department of Urology, St. Elisabeth Hospital Straubing, Brothers of Mercy Hospital, Straubing, Germany.

Department of Urology, Caritas St. Josef Medical Centre, University of Regensburg, Regensburg, Germany.

出版信息

Urol Int. 2024;108(4):359-366. doi: 10.1159/000537854. Epub 2024 Mar 30.

Abstract

INTRODUCTION

This study assessed the potential of large language models (LLMs) as educational tools by evaluating their accuracy in answering questions across urological subtopics.

METHODS

Three LLMs (ChatGPT-3.5, ChatGPT-4, and Bing AI) were examined in two testing rounds, separated by 48 h, using 100 Multiple-Choice Questions (MCQs) from the 2022 European Board of Urology (EBU) In-Service Assessment (ISA), covering five different subtopics. The correct answer was defined as "formal accuracy" (FA) representing the designated single best answer (SBA) among four options. Alternative answers selected from LLMs, which may not necessarily be the SBA but are still deemed correct, were labeled as "extended accuracy" (EA). Their capacity to enhance the overall accuracy rate when combined with FA was examined.

RESULTS

In two rounds of testing, the FA scores were achieved as follows: ChatGPT-3.5: 58% and 62%, ChatGPT-4: 63% and 77%, and BING AI: 81% and 73%. The incorporation of EA did not yield a significant enhancement in overall performance. The achieved gains for ChatGPT-3.5, ChatGPT-4, and BING AI were as a result 7% and 5%, 5% and 2%, and 3% and 1%, respectively (p > 0.3). Within urological subtopics, LLMs showcased best performance in Pediatrics/Congenital and comparatively less effectiveness in Functional/BPS/Incontinence.

CONCLUSION

LLMs exhibit suboptimal urology knowledge and unsatisfactory proficiency for educational purposes. The overall accuracy did not significantly improve when combining EA to FA. The error rates remained high ranging from 16 to 35%. Proficiency levels vary substantially across subtopics. Further development of medicine-specific LLMs is required before integration into urological training programs.

摘要

介绍

本研究通过评估大型语言模型(LLM)在回答泌尿科各亚专业问题方面的准确性,评估其作为教育工具的潜力。

方法

在两轮测试中检查了三个 LLM(ChatGPT-3.5、ChatGPT-4 和 Bing AI),两轮测试相隔 48 小时,使用了来自 2022 年欧洲泌尿外科委员会(EBU)在职评估(ISA)的 100 个多项选择题(MCQ),涵盖了五个不同的亚专业。正确答案被定义为“正式准确性”(FA),代表四个选项中的指定最佳答案(SBA)。从 LLM 中选择的其他答案可能不是 SBA,但仍被认为是正确的,被标记为“扩展准确性”(EA)。检查了它们与 FA 结合使用时提高整体准确率的能力。

结果

在两轮测试中,FA 得分如下:ChatGPT-3.5:58%和 62%,ChatGPT-4:63%和 77%,BING AI:81%和 73%。EA 的加入并没有显著提高整体表现。ChatGPT-3.5、ChatGPT-4 和 BING AI 的增益分别为 7%和 5%、5%和 2%以及 3%和 1%(p > 0.3)。在泌尿科亚专业中,LLM 在儿科/先天性方面表现出最佳性能,在功能/BPS/失禁方面效果较差。

结论

LLM 表现出亚专业泌尿科知识不足,教育目的的熟练程度不理想。当将 EA 与 FA 结合使用时,整体准确性没有显著提高。错误率仍然很高,范围在 16%到 35%之间。熟练程度在亚专业之间有很大差异。在将其集成到泌尿科培训计划之前,需要开发特定于医学的 LLM。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/191c/11305516/d653c10930e7/uin-2024-0108-0004-537854_F01.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验