Cao Huawei, Hao Changzhen, Zhang Tao, Zheng Xiang, Gao Zihao, Wu Jiyue, Gan Lijian, Liu Yu, Zeng Xiangjun, Wang Wei
Department of Urology, Beijing Chao-yang Hospital, Capital Medical University, Beijing, China.
Department of Urology, Peking University International Hospital, Beijing, China.
Front Public Health. 2025 Jul 23;13:1605908. doi: 10.3389/fpubh.2025.1605908. eCollection 2025.
With the rapid advancement and widespread adoption of artificial intelligence (AI), patients increasingly turn to AI for initial medical guidance. Therefore, a comprehensive evaluation of AI-generated responses is warranted. This study aimed to compare the performance of DeepSeek and ChatGPT in answering urinary incontinence-related questions and to delineate their respective strengths and limitations.
Based on the American Urological Association/Society of Urodynamics, Female Pelvic Medicine & Urogenital Reconstruction (AUA/SUFU) and European Association of Urology (EAU) guidelines, we designed 25 urinary incontinence-related questions. Responses from DeepSeek and ChatGPT-4.0 were evaluated for reliability, quality, and readability. Fleiss' kappa was employed to calculate inter-rater reliability. For clinical case scenarios, we additionally assessed the appropriateness of responses. A comprehensive comparative analysis was performed.
The modified DISCERN (mDISCERN) scores for DeepSeek and ChatGPT-4.0 were 28.24 ± 0.88 and 28.76 ± 1.56, respectively, showing no practically meaningful difference [ = 0.188, Cohen's = 0.41 (95% : -0.15, 0.97)]. Both AI chatbots rarely provided source references. In terms of quality, DeepSeek achieved a higher mean Global Quality Scale (GQS) score than ChatGPT-4.0 (4.76 ± 0.52 vs. 4.32 ± 0.69, = 0.001). DeepSeek also demonstrated superior readability, as indicated by a higher Flesch Reading Ease (FRE) score (76.43 ± 10.90 vs. 70.95 ± 11.16, = 0.039) and a lower Simple Measure of Gobbledygook (SMOG) index (12.26 ± 1.39 vs. 14.21 ± 1.88, < 0.001), suggesting easier comprehension. Regarding guideline adherence, DeepSeek had 11 (73.33%) fully compliant responses, while ChatGPT-4.0 had 13 (86.67%), with no significant difference [ = 0.651, Cohen's = 0.083 (95% CI: 0.021, 0.232)].
DeepSeek and ChatGPT-4.0 might exhibit comparable reliability in answering urinary incontinence-related questions, though both lacked sufficient references. However, DeepSeek outperformed ChatGPT-4.0 in response quality and readability. While both AI chatbots largely adhered to clinical guidelines, occasional deviations were observed. Further refinements are necessary before the widespread clinical implementation of AI chatbots in urology.
随着人工智能(AI)的迅速发展和广泛应用,患者越来越多地向AI寻求初步医疗指导。因此,对AI生成的回答进行全面评估是必要的。本研究旨在比较DeepSeek和ChatGPT在回答尿失禁相关问题方面的表现,并明确它们各自的优势和局限性。
基于美国泌尿外科学会/尿动力学、女性盆底医学与泌尿生殖重建学会(AUA/SUFU)以及欧洲泌尿外科学会(EAU)的指南,我们设计了25个尿失禁相关问题。对DeepSeek和ChatGPT-4.0的回答进行可靠性、质量和可读性评估。采用Fleiss' kappa计算评分者间信度。对于临床病例场景,我们还评估了回答的适当性。进行了全面的比较分析。
DeepSeek和ChatGPT-4.0的改良DISCERN(mDISCERN)评分分别为28.24±0.88和28.76±1.56,显示出实际上无显著差异[P = 0.188,Cohen's d = 0.41(95% CI:-0.15,0.97)]。两个AI聊天机器人都很少提供参考文献。在质量方面,DeepSeek的平均全球质量量表(GQS)得分高于ChatGPT-4.0(4.76±0.52对4.32±0.69,P = 0.001)。DeepSeek还表现出更好的可读性,其弗莱什易读性(FRE)得分更高(76.43±10.90对70.95±11.16,P = 0.039),而简化的晦涩难懂度量(SMOG)指数更低(12.26±1.39对14.21±1.88,P < 0.001),表明更容易理解。在遵循指南方面,DeepSeek有11个(73.33%)完全符合的回答,而ChatGPT-4.0有13个(86.67%),无显著差异[P = 0.651,Cohen's d = 0.083(95% CI:0.021,0.232)]。
DeepSeek和ChatGPT-4.0在回答尿失禁相关问题时可能表现出相当的可靠性,尽管两者都缺乏足够的参考文献。然而,DeepSeek在回答质量和可读性方面优于ChatGPT-4.0。虽然两个AI聊天机器人在很大程度上遵循了临床指南,但也观察到偶尔有偏差。在AI聊天机器人在泌尿外科广泛临床应用之前,还需要进一步改进。