Altıntaş Emre, Ozkent Mehmet Serkan, Gül Murat, Batur Ali Furkan, Kaynar Mehmet, Kılıç Özcan, Göktaş Serdar
Selcuk University, Faculty of Medicine, Department of Urology, Konya, Turkey.
Konya City Hospital, Department of Urology, Konya, Turkey.
Fr J Urol. 2024 Jul;34(7-8):102666. doi: 10.1016/j.fjurol.2024.102666. Epub 2024 Jun 5.
Artificial intelligence (AI) applications are increasingly being utilized by both patients and physicians for accessing medical information. This study focused on the urolithiasis section (pertaining to kidney and ureteral stones) of the European Association of Urology (EAU) guideline, a key reference for urologists.
We directed inquiries to four distinct AI chatbots to assess their responses in relation to guideline adherence. A total of 115 recommendations were transformed into questions, and responses were evaluated by two urologists with a minimum of 5 years of experience using a 5-point Likert scale (1 - False, 2 - Inadequate, 3 - Sufficient, 4 - Correct, and 5 - Very correct).
The mean scores for Perplexity and ChatGPT 4.0 were 4.68 (SD: 0.80) and 4.80 (SD: 0.47), respectively, both significantly differed the scores of Bing and Bard (Bing vs. Perplexity, P<0.001; Bard vs. Perplexity, P<0.001; Bing vs. ChatGPT, P<0.001; Bard vs. ChatGPT, P<0.001). Bing had a mean score of 4.21 (SD: 0.96), while Bard scored 3.56 (SD: 1.14), with a significant difference (Bing vs. Bard, P<0.001). Bard exhibited the lowest score among all chatbots. Analysis of references revealed that Perplexity and Bing cited the guideline most frequently (47.3% and 30%, respectively).
Our findings demonstrate that ChatGPT 4.0 and, notably, Perplexity align well with EAU guideline recommendations. These continuously evolving applications may play a crucial role in delivering information to physicians in the future, especially for urolithiasis.
患者和医生越来越多地使用人工智能(AI)应用程序来获取医疗信息。本研究聚焦于欧洲泌尿外科学会(EAU)指南中的尿石症部分(涉及肾结石和输尿管结石),这是泌尿外科医生的重要参考资料。
我们向四个不同的人工智能聊天机器人提问,以评估它们在遵循指南方面的回答。总共将115条建议转化为问题,并由两位至少有5年经验的泌尿外科医生使用5分李克特量表(1 - 错误,2 - 不足,3 - 充分,4 - 正确,5 - 非常正确)对回答进行评估。
Perplexity和ChatGPT 4.0的平均得分分别为4.68(标准差:0.80)和4.80(标准差:0.47),两者得分均与必应和巴德的得分有显著差异(必应与Perplexity相比,P<0.001;巴德与Perplexity相比,P<0.001;必应与ChatGPT相比,P<0.001;巴德与ChatGPT相比,P<0.001)。必应的平均得分为4.21(标准差:0.96),而巴德的得分为3.56(标准差:1.14),两者有显著差异(必应与巴德相比,P<0.001)。巴德在所有聊天机器人中得分最低。对参考文献的分析表明,Perplexity和必应引用该指南的频率最高(分别为47.3%和30%)。
我们的研究结果表明,ChatGPT 4.0,尤其是Perplexity,与EAU指南建议的契合度很高。这些不断发展的应用程序未来可能在向医生提供信息方面发挥关键作用,尤其是对于尿石症。