Hacibey Ibrahim, Halis Ahmet
Department of Urology, Basaksehir Çam and Sakura City Hospital, Istanbul, Türkiye.
Department of Urology, Yedikule Chest Diseases and Chest Surgery Training and Research Hospital, Istanbul, Türkiye.
Investig Clin Urol. 2025 May;66(3):188-193. doi: 10.4111/icu.20250040.
This study aimed to evaluate the performance of three artificial intelligence (AI) models-ChatGPT, Gemini, and Copilot-in addressing clinically relevant questions about onabotulinum toxin and sacral neuromodulation (SNM) for the management of overactive bladder (OAB).
A set of 30 questions covering mechanisms of action, indications, contraindications, procedural details, efficacy, and safety profiles was posed to each AI model. Responses were assessed by a panel of four urology specialists using predefined criteria: accuracy, completeness, clarity, and consistency. A multi-dimensional scoring framework evaluated the performance across five dimensions: factual accuracy, relevance, clarity/coherence, structure, and utility. Responses were scored on a 4-point Likert scale, and statistical analyses were conducted using one-way ANOVA to compare model performance.
ChatGPT achieved the highest mean score (3.98/4) across all dimensions, with statistically significant differences compared to Gemini (3.20/4) and Copilot (2.60/4) (p=0.001 for all dimensions). ChatGPT excelled particularly in clinical application, procedure, and safety categories, consistently delivering accurate and comprehensive answers. No statistically significant differences were found between Gemini and Copilot in most categories.
ChatGPT demonstrated superior performance in generating accurate, complete, and clinically relevant responses for OAB management, highlighting its potential as a reliable tool for both healthcare professionals and patients. However, the variability observed in Gemini and Copilot underscores the need for further refinement of these models. Future studies should explore real-world integration of AI models into clinical workflows to enhance patient care and decision-making.
本研究旨在评估三种人工智能(AI)模型——ChatGPT、Gemini和Copilot——在解决有关用于治疗膀胱过度活动症(OAB)的A型肉毒毒素和骶神经调节(SNM)的临床相关问题方面的性能。
向每个AI模型提出了一组30个问题,涵盖作用机制、适应症、禁忌症、操作细节、疗效和安全性概况。由四位泌尿外科专家组成的小组使用预定义标准对回答进行评估:准确性、完整性、清晰度和一致性。一个多维度评分框架评估了五个维度的性能:事实准确性、相关性、清晰度/连贯性、结构和实用性。回答根据4点李克特量表进行评分,并使用单因素方差分析进行统计分析以比较模型性能。
ChatGPT在所有维度上的平均得分最高(3.98/4),与Gemini(3.20/4)和Copilot(2.60/4)相比有统计学显著差异(所有维度p = 0.001)。ChatGPT在临床应用、操作和安全类别方面表现尤为出色,始终提供准确和全面的答案。在大多数类别中,Gemini和Copilot之间未发现统计学显著差异。
ChatGPT在生成有关OAB管理的准确、完整且临床相关的回答方面表现出卓越性能,凸显了其作为医疗专业人员和患者可靠工具的潜力。然而,Gemini和Copilot中观察到的变异性强调了进一步优化这些模型的必要性。未来的研究应探索将AI模型实际整合到临床工作流程中,以改善患者护理和决策。