Asker Omer Faruk, Recai Muhammed Selim, Genc Yunus Emre, Dogan Kader Ada, Sener Tarik Emre, Sahin Bahadir
School of Medicine, Marmara University, Istanbul, Turkey.
Department of Urology, School of Medicine, Marmara University, Istanbul, Turkey.
BJU Int. 2025 Jul 31. doi: 10.1111/bju.16873.
To evaluate widely used chatbots' accuracy, calibration error, readability, and understandability with objective measurements by 35 questions derived from urology in-service examinations, as the integration of large language models (LLMs) into healthcare has gained increasing attention, raising questions about their applications and limitations.
A total of 35 European Board of Urology questions were asked to five LLMs with a standardised prompt that was systematically designed and used across all models: ChatGPT-4o, DeepSeek-R1, Gemini, Grok-2, and Claude 3.5. Accuracy was calculated by Cohen's kappa for all models. Readability was assessed by Flesch Reading Ease, Gunning Fog, Coleman-Liau, Simple Measure of Gobbledygook, and Automated Readability Index, while understandability was determined by scores of residents' ratings by a Likert scale.
The models and answer key were in substantial agreement with a Fleiss' kappa of 0.701, and Cronbach's alpha of 0.914. For accuracy, Cohen's kappa was 0.767 for ChatGPT-4o, 0.764 for DeepSeek-R, and 0.765 for Grok-2 (80% accuracy for each), followed by 0.729 for Claude 3.5 (77% accuracy) and 0.611 for Gemini (68.4% accuracy). The lowest calibration error was found in ChatGPT-4o (19.2%) and DeepSeek-R1 scored the highest for readability. In understandability analysis, Claude 3.5 had the highest rating compared to others.
Chatbots demonstrated various powers across different tasks. DeepSeek-R1, despite being just released, showed promising results in medical applications. These findings highlight the need for further optimisation to better understand the applications of chatbots in urology.
随着大语言模型(LLMs)在医疗保健领域的整合日益受到关注,引发了对其应用和局限性的质疑,通过从泌尿外科在职考试中选取的35个问题进行客观测量,评估广泛使用的聊天机器人的准确性、校准误差、可读性和可理解性。
向五个大语言模型(ChatGPT-4o、DeepSeek-R1、Gemini、Grok-2和Claude 3.5)提出总共35个欧洲泌尿外科委员会的问题,使用一个经过系统设计且在所有模型中统一使用的标准化提示。通过科恩kappa系数计算所有模型的准确性。通过弗莱什易读性、冈宁雾度、科尔曼-廖、简单费解度测量和自动可读性指数评估可读性,而可理解性则通过居民李克特量表评分来确定。
模型与答案键高度一致,弗莱希kappa系数为0.701,克朗巴赫α系数为0.914。在准确性方面,ChatGPT-4o的科恩kappa系数为0.767,DeepSeek-R为0.764,Grok-2为0.765(每个的准确率均为80%),其次是Claude 3.5的0.729(准确率77%)和Gemini的0.611(准确率68.4%)。校准误差最低的是ChatGPT-4o(19.2%),DeepSeek-R1的可读性得分最高。在可理解性分析中,Claude 3.5的评分高于其他模型。
聊天机器人在不同任务中展现出不同的能力。DeepSeek-R1尽管刚发布,但在医疗应用中显示出有前景的结果。这些发现凸显了进一步优化以更好理解聊天机器人在泌尿外科应用的必要性。