Baturu Muharrem, Solakhan Mehmet, Kazaz Tanyeli Guneyligil, Bayrak Omer
Department of Urology, University of Gaziantep, Gaziantep, Turkey.
Department of Urology, Hasan Kalyoncu University, Gaziantep, Turkey.
Int J Impot Res. 2025 Apr;37(4):310-314. doi: 10.1038/s41443-024-00898-3. Epub 2024 May 7.
The present study assessed the accuracy of artificiaI intelligence-generated responses to frequently asked questions on erectile dysfunction. A cross-sectional analysis involved 56 erectile dysfunction-related questions searched on Google, categorized into nine sections: causes, diagnosis, treatment options, treatment complications, protective measures, relationship with other illnesses, treatment costs, treatment with herbal agents, and appointments. Responses from ChatGPT 3.5, ChatGPT 4, and BARD were evaluated by two experienced urology experts using the F1 and global quality scores (GQS) for accuracy, relevance, and comprehensibility. ChatGPT 3.5 and ChatGPT 4 achieved higher GQS than BARD in categories such as causes (4.5 ± 0.54, 4.5 ± 0.51, 3.15 ± 1.01, respectively, p < 0.001), treatment options (4.35 ± 0.6, 4.5 ± 0.43, 2.71 ± 1.38, respectively, p < 0.001), protective measures (5.0 ± 0, 5.0 ± 0, 4 ± 0.5, respectively, p = 0.013), relationships with other illnesses (4.58 ± 0.58, 4.83 ± 0.25, 3.58 ± 0.8, respectively, p = 0.006), and treatment with herbal agents (3 ± 0.61, 3.33 ± 0.83, 1.8 ± 1.09, respectively, p = 0.043). F1 scores in categories: causes (1), diagnosis (0.857), treatment options (0.726), and protective measures (1), indicated their alignment with the guidelines. There was no significant difference between ChatGPT 3.5 and ChatGPT 4 regarding answer quality, but both outperformed BARD in the GQS. These results emphasize the need to continually enhance and validate AI-generated medical information, underscoring the importance of artificiaI intelligence systems in delivering reliable information on erectile dysfunction.
本研究评估了人工智能生成的关于勃起功能障碍常见问题回答的准确性。一项横断面分析涉及在谷歌上搜索的56个与勃起功能障碍相关的问题,分为九个部分:病因、诊断、治疗选择、治疗并发症、保护措施、与其他疾病的关系、治疗费用、草药治疗以及预约。两位经验丰富的泌尿外科专家使用F1和全球质量评分(GQS)对ChatGPT 3.5、ChatGPT 4和BARD的回答进行准确性、相关性和可理解性评估。在病因(分别为4.5±0.54、4.5±0.51、3.15±1.01,p<0.001)、治疗选择(分别为4.35±0.6、4.5±0.43、2.71±1.38,p<0.001)、保护措施(分别为5.0±0、5.0±0、4±0.5,p = 0.013)、与其他疾病的关系(分别为4.58±0.58、4.83±0.25、3.58±0.8,p = 0.006)以及草药治疗(分别为3±0.61、3.33±0.83、1.8±1.09,p = 0.043)等类别中,ChatGPT 3.5和ChatGPT 4的GQS得分高于BARD。各分类的F1得分:病因(1)、诊断(0.857)、治疗选择(0.726)和保护措施(1),表明它们与指南相符。ChatGPT 3.5和ChatGPT 4在回答质量方面没有显著差异,但在GQS上均优于BARD。这些结果强调了持续增强和验证人工智能生成的医学信息的必要性,凸显了人工智能系统在提供关于勃起功能障碍可靠信息方面的重要性。