Evaluating the Readability and Quality of Bladder Cancer Information from AI Chatbots: A Comparative Study Between ChatGPT, Google Gemini, Grok, Claude and DeepSeek.

作者信息

Patel Kunjan, Radcliffe Robert

机构信息

Royal Derby Hospital, Derby DE22 3NE, UK.

出版信息

J Clin Med. 2025 Nov 3;14(21):7804. doi: 10.3390/jcm14217804.

DOI:10.3390/jcm14217804

PMID:41227199

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12610445/

Abstract

: Artificial Intelligence (AI)-based chatbots such as ChatGPT are easily available and are quickly becoming a source of information for patients as opposed to traditional Google searches. We assessed the quality of information on bladder cancer, provided by various AI chatbots such as ChatGPT 4o, Google Gemini 2.0 flash, Grok 3, Claude Sonnet 3.7 and DeepSeek R1. Their responses were analysed in terms of Readability Indices, and two consultant urologists rated the quality of information provided using the validated DISCERN tool. : The top 10 most frequently asked questions about bladder cancer were identified using Google Trends. These questions were then provided to five different AI chatbots, and their responses were collected. No prompts were used, reflecting natural language queries that patients would use. The responses were analysed in terms of their readability using five validated indices: Flesch Reading Ease (FRE), the Flesch-Kincaid Reading Grade Level (FKRGL), the Gunning Fog Index, the Coleman-Liau Index and the SMOG index. Two consultant urologists then independently assessed the responses of various AI chatbots using the DISCERN tool, which rates the quality of the health information on a five-point LIKERT scale. Inter-rater agreement was calculated using Cohen's Kappa and the intraclass correlation coefficient (ICC). : ChatGPT 4o was the overall winner in readability scores, with the highest Flesch Reading Ease score (59.4) and the lowest average reading grade level (7.0) required to understand the material. Grok 3 was a close second (FRE 58.3, grade level 8.7). Claude 3.7 Sonnet used the most complex language in its answers and therefore scored the lowest FRE score of 44.9, with the highest grade level (9.5) and also the highest complexity on other indices. In the DISCERN analysis, Grok 3 received the highest average score (52.0), followed closely by ChatGPT 4o (50.5). The inter-rater agreement was highest for ChatGPT 4o (ICC: 0.791; Kappa: 0.437), while it was lowest for Grok 3 (ICC: 0.339, Kappa 0.0, Weighted Kappa 0.335). : All AI chatbots can provide generally good-quality answers to questions about bladder cancer with zero hallucinations. ChatGPT 4o was the overall winner, with the best readability metrics, strong DISCERN ratings and highest inter-rater agreement.

摘要

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/109d/12610445/0d482e285a27/jcm-14-07804-g001.jpg

相似文献

Evaluating the Readability and Quality of Bladder Cancer Information from AI Chatbots: A Comparative Study Between ChatGPT, Google Gemini, Grok, Claude and DeepSeek.

J Clin Med. 2025 Nov 3;14(21):7804. doi: 10.3390/jcm14217804.

Comparative Evaluation of Responses from ChatGPT-5, Gemini 2.5 Flash, Grok 4, and Claude Sonnet-4 Chatbots to Questions About Endodontic Iatrogenic Events.ChatGPT-5、Gemini 2.5 Flash、Grok 4和Claude Sonnet-4聊天机器人对牙髓医源性事件相关问题的回答的比较评估

Healthcare (Basel). 2025 Oct 17;13(20):2615. doi: 10.3390/healthcare13202615.

Evaluating the Quality and Readability of Generative Artificial Intelligence (AI) Chatbot Responses in the Management of Achilles Tendon Rupture.评估生成式人工智能（AI）聊天机器人在跟腱断裂管理中的回复质量和可读性。

Cureus. 2025 Jan 31;17(1):e78313. doi: 10.7759/cureus.78313. eCollection 2025 Jan.

AI-generated patient education for ankylosing spondylitis: a comparative study of readability and quality.

Clin Rheumatol. 2026 Mar;45(3):2003-2008. doi: 10.1007/s10067-025-07771-8. Epub 2025 Dec 13.

Accuracy and Readability of Artificial Intelligence Chatbot Responses to Vasectomy-Related Questions: Public Beware.人工智能聊天机器人对输精管切除术相关问题回答的准确性和可读性：公众需谨慎。

Cureus. 2024 Aug 28;16(8):e67996. doi: 10.7759/cureus.67996. eCollection 2024 Aug.

Assessing the Readability of Patient Education Materials on Cardiac Catheterization From Artificial Intelligence Chatbots: An Observational Cross-Sectional Study.评估人工智能聊天机器人提供的心脏导管插入术患者教育材料的可读性：一项观察性横断面研究。

Cureus. 2024 Jul 4;16(7):e63865. doi: 10.7759/cureus.63865. eCollection 2024 Jul.

Chatbots in urology: accuracy, calibration, and comprehensibility; is DeepSeek taking over the throne?泌尿外科中的聊天机器人：准确性、校准和可理解性；DeepSeek 会取而代之吗？

BJU Int. 2025 Jul 31. doi: 10.1111/bju.16873.

Balancing Accuracy and Readability: Comparative Evaluation of AI Chatbots for Patient Education on Rotator Cuff Tears.

Healthcare (Basel). 2025 Oct 23;13(21):2670. doi: 10.3390/healthcare13212670.

Evaluation of the artificial intelligence chatbots in frequently asked questions about retinitis pigmentosa: a comparative analysis between ChatGPT-4 and Gemini-2.0.

Int J Retina Vitreous. 2025 Nov 28;12(1):1. doi: 10.1186/s40942-025-00772-4.

Assessing the quality and readability of patient education materials on chemotherapy cardiotoxicity from artificial intelligence chatbots: An observational cross-sectional study.评估人工智能聊天机器人提供的关于化疗心脏毒性的患者教育材料的质量和可读性：一项观察性横断面研究。

Medicine (Baltimore). 2025 Apr 11;104(15):e42135. doi: 10.1097/MD.0000000000042135.

本文引用的文献

Psychological Distress in Bladder Cancer Patients: A Systematic Review.膀胱癌患者的心理困扰：系统评价。

Cancer Med. 2024 Nov;13(22):e70345. doi: 10.1002/cam4.70345.

Reference Hallucination Score for Medical Artificial Intelligence Chatbots: Development and Usability Study.医学人工智能聊天机器人的参考幻觉评分：开发与可用性研究。

JMIR Med Inform. 2024 Jul 31;12:e54345. doi: 10.2196/54345.

Use of Artificial Intelligence Chatbots in Interpretation of Pathology Reports.人工智能聊天机器人在病理报告解读中的应用。

JAMA Netw Open. 2024 May 1;7(5):e2412767. doi: 10.1001/jamanetworkopen.2024.12767.

ChatGPT Earns American Board Certification in Hand Surgery.ChatGPT 获得美国手部外科委员会认证。

Hand Surg Rehabil. 2024 Jun;43(3):101688. doi: 10.1016/j.hansur.2024.101688. Epub 2024 Mar 27.

Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study.ChatGPT在秘鲁国家医学执照考试中的表现：横断面研究

JMIR Med Educ. 2023 Sep 28;9:e48039. doi: 10.2196/48039.

The Rapid Development of Artificial Intelligence: GPT-4's Performance on Orthopedic Surgery Board Questions.人工智能的快速发展：GPT-4 在骨科手术委员会问题上的表现。

Orthopedics. 2024 Mar-Apr;47(2):e85-e89. doi: 10.3928/01477447-20230922-05. Epub 2023 Sep 27.

Quality of information and appropriateness of ChatGPT outputs for urology patients.针对泌尿外科患者的ChatGPT输出信息质量及适用性

Prostate Cancer Prostatic Dis. 2024 Mar;27(1):103-108. doi: 10.1038/s41391-023-00705-y. Epub 2023 Jul 29.

Long-term Recurrence Rates of Low-risk Non-muscle-invasive Bladder Cancer-How Long Is Cystoscopic Surveillance Necessary?低危非肌层浸润性膀胱癌的长期复发率——膀胱镜监测需要持续多久？

Eur Urol Focus. 2024 Jan;10(1):189-196. doi: 10.1016/j.euf.2023.06.012. Epub 2023 Jul 11.

Evaluating the Effectiveness of Artificial Intelligence-powered Large Language Models Application in Disseminating Appropriate and Readable Health Information in Urology.评估人工智能驱动的大型语言模型在泌尿外科传播恰当且易读的健康信息方面的有效性。

J Urol. 2023 Oct;210(4):688-694. doi: 10.1097/JU.0000000000003615. Epub 2023 Jul 10.

Bridging the Gap Between Urological Research and Patient Understanding: The Role of Large Language Models in Automated Generation of Layperson's Summaries.弥合泌尿科研究与患者理解之间的差距：大型语言模型在生成非专业人士摘要方面的作用。

Urol Pract. 2023 Sep;10(5):436-443. doi: 10.1097/UPJ.0000000000000428. Epub 2023 Jul 5.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验