Department of Urology, Bagcilar Training and Research Hospital, University of Health Sciences, Istanbul, Turkey.
Department of Urology, Faculty of Medicine, Istinye University, Istanbul, Turkey.
World J Urol. 2024 Mar 14;42(1):158. doi: 10.1007/s00345-024-04847-z.
The study aimed to assess the efficacy of OpenAI's advanced AI model, ChatGPT, in diagnosing urological conditions, focusing on kidney stones.
A set of 90 structured questions, compliant with EAU Guidelines 2023, was curated by seasoned urologists for this investigation. We evaluated ChatGPT's performance based on the accuracy and completeness of its responses to two types of questions [binary (true/false) and descriptive (multiple-choice)], stratified into difficulty levels: easy, moderate, and complex. Furthermore, we analyzed the model's learning and adaptability capacity by reassessing the initially incorrect responses after a 2 week interval.
The model demonstrated commendable accuracy, correctly answering 80% of binary questions (n:45) and 93.3% of descriptive questions (n:45). The model's performance showed no significant variation across different question difficulty levels, with p-values of 0.548 for accuracy and 0.417 for completeness, respectively. Upon reassessment of initially 12 incorrect responses (9 binary to 3 descriptive) after two weeks, ChatGPT's accuracy showed substantial improvement. The mean accuracy score significantly increased from 1.58 ± 0.51 to 2.83 ± 0.93 (p = 0.004), underlining the model's ability to learn and adapt over time.
These findings highlight the potential of ChatGPT in urological diagnostics, but also underscore areas requiring enhancement, especially in the completeness of responses to complex queries. The study endorses AI's incorporation into healthcare, while advocating for prudence and professional supervision in its application.
本研究旨在评估 OpenAI 的先进人工智能模型 ChatGPT 在诊断泌尿科疾病(重点是肾结石)方面的疗效。
一组由经验丰富的泌尿科医生精心设计的 90 个结构化问题,符合 EAU 2023 指南,用于本次调查。我们根据模型对两种类型问题(二进制[真/假]和描述性[多项选择])的回答的准确性和完整性来评估 ChatGPT 的性能,并将其分为简单、中等和复杂三个难度级别。此外,我们还通过在两周后重新评估最初错误的回答来分析模型的学习和适应能力。
该模型表现出令人称赞的准确性,正确回答了 80%的二进制问题(n=45)和 93.3%的描述性问题(n=45)。模型的性能在不同问题难度级别之间没有显著差异,准确性的 p 值为 0.548,完整性的 p 值为 0.417。两周后,对最初的 12 个错误回答(9 个二进制对 3 个描述性)进行重新评估后,ChatGPT 的准确性有了显著提高。平均准确性评分从 1.58±0.51 显著增加到 2.83±0.93(p=0.004),突出了模型随时间学习和适应的能力。
这些发现突出了 ChatGPT 在泌尿科诊断中的潜力,但也强调了需要改进的领域,特别是在复杂查询的回答完整性方面。该研究支持将人工智能纳入医疗保健,但同时也提倡在其应用中保持谨慎和专业监督。