人工智能聊天机器人对输精管切除术相关问题回答的准确性和可读性：公众需谨慎。

Accuracy and Readability of Artificial Intelligence Chatbot Responses to Vasectomy-Related Questions: Public Beware.

作者信息

Carlson Jonathan A, Cheng Robin Z, Lange Alyssa, Nagalakshmi Nadiminty, Rabets John, Shah Tariq, Sindhwani Puneet

机构信息

Urology, The University of Toledo College of Medicine and Life Sciences, Toledo, USA.

Urology, The University of Toledo Medical Center, Toledo, USA.

出版信息

Cureus. 2024 Aug 28;16(8):e67996. doi: 10.7759/cureus.67996. eCollection 2024 Aug.

DOI:10.7759/cureus.67996

PMID:39347335

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11427961/

Abstract

Purpose Artificial intelligence (AI) has rapidly gained popularity with the growth of ChatGPT (OpenAI, San Francisco, USA) and other large-language model chatbots, and these programs have tremendous potential to impact medicine. One important area of consequence in medicine and public health is that patients may use these programs in search of answers to medical questions. Despite the increased utilization of AI chatbots by the public, there is little research to assess the reliability of ChatGPT and alternative programs when queried for medical information. This study seeks to elucidate the accuracy and readability of AI chatbots in answering patient questions regarding urology. As vasectomy is one of the most common urologic procedures, this study investigates AI-generated responses to frequently asked vasectomy-related questions. For this study, five popular and free-to-access AI platforms were utilized to undertake this investigation. Methods Fifteen vasectomy-related questions were individually queried to five AI chatbots from November-December 2023: ChatGPT (OpenAI, San Francisco, USA), Bard (Google Inc., Mountainview, USA) Bing (Microsoft, Redmond, USA) Perplexity (Perplexity AI Inc., San Francisco, USA), and Claude (Anthropic, San Francisco, USA). Responses from each platform were graded by two attending urologists, two urology research faculty, and one urological resident physician using a Likert (1-6) scale: (1-completely inaccurate, 6-completely accurate) based on comparison to existing American Urological Association guidelines. Flesch-Kincaid Grade levels (FKGL) and Flesch Reading Ease scores (FRES) (1-100) were calculated for each response. To assess differences in Likert, FRES, and FKGL, Kruskal-Wallis tests were performed using GraphPad Prism V10.1.0 (GraphPad, San Diego, USA) with Alpha set at 0.05. Results Analysis shows that ChatGPT provided the most accurate responses across the five AI chatbots with an average score of 5.04 on the Likert scale. Subsequently, Microsoft Bing (4.91), Anthropic Claude (4.65), Google Bard (4.43), and Perplexity (4.41) followed. All five chatbots were found to score, on average, higher than 4.41 corresponding to a score of at least "somewhat accurate." Google Bard received the highest Flesch Reading Ease score (49.67) and lowest Grade level (10.1) when compared to the other chatbots. Anthropic Claude scored 46.7 on the FRES and 10.55 on the FKGL. Microsoft Bing scored 45.57 on the FRES and 11.56 on the FKGL. Perplexity scored 36.4 on the FRES and 13.29 on the FKGL. ChatGPT had the lowest FRES of 30.4 and highest FKGL of 14.2. Conclusion This study investigates the use of AI in medicine, specifically urology, and it helps to determine whether large-language model chatbots can be reliable sources of freely available medical information. All five AI chatbots on average were able to achieve at least "somewhat accurate" on a 6-point Likert scale. In terms of readability, all five AI chatbots on average had Flesch Reading Ease scores of less than 50 and were higher than a 10th-grade level. In this small-scale study, there were several significant differences identified between the readability scores of each AI chatbot. However, there were no significant differences found among their accuracies. Thus, our study suggests that major AI chatbots may perform similarly in their ability to be correct but differ in their ease of being comprehended by the general public.

摘要

目的随着ChatGPT（美国旧金山OpenAI公司）和其他大语言模型聊天机器人的发展，人工智能（AI）迅速受到欢迎，这些程序对医学具有巨大的潜在影响。医学和公共卫生领域一个重要的影响方面是患者可能会使用这些程序来寻找医学问题的答案。尽管公众对AI聊天机器人的使用有所增加，但在查询医学信息时，评估ChatGPT和其他替代程序可靠性的研究却很少。本研究旨在阐明AI聊天机器人在回答患者关于泌尿外科问题时的准确性和可读性。由于输精管切除术是最常见的泌尿外科手术之一，本研究调查了AI对常见输精管切除术相关问题的回答。在本研究中，使用了五个流行且可免费访问的AI平台进行此项调查。

方法 2023年11月至12月，分别向五个AI聊天机器人提出了15个与输精管切除术相关的问题：ChatGPT（美国旧金山OpenAI公司）、Bard（美国山景城谷歌公司）、必应（美国雷德蒙德微软公司）、Perplexity（美国旧金山Perplexity AI公司）和Claude（美国旧金山Anthropic公司）。每个平台的回答由两名泌尿外科主治医师、两名泌尿外科研究人员和一名泌尿外科住院医师根据与美国泌尿外科学会现有指南的比较，使用李克特（1 - 6）量表进行评分：（1 - 完全不准确，6 - 完全准确）。计算每个回答的弗莱什 - 金凯德年级水平（FKGL）和弗莱什阅读简易度得分（FRES）（1 - 100）。为了评估李克特量表、FRES和FKGL的差异，使用GraphPad Prism V10.1.0（美国圣地亚哥GraphPad公司）进行Kruskal - Wallis检验，α设定为0.05。

结果分析表明，在五个AI聊天机器人中，ChatGPT提供的回答最准确，在李克特量表上的平均得分为5.04。随后是微软必应（得分4.91）、Anthropic Claude（得分4.65）、谷歌Bard（得分4.43）和Perplexity（得分4.41）。发现所有五个聊天机器人的平均得分均高于4.41，对应至少“有点准确”的分数。与其他聊天机器人相比，谷歌Bard的弗莱什阅读简易度得分最高（49.67），年级水平最低（10.1）。Anthropic Claude的FRES得分为46.7，FKGL得分为10.55。微软必应的FRES得分为45.57，FKGL得分为11.56。Perplexity的FRES得分为36.4，FKGL得分为13.29。ChatGPT的FRES最低，为30.4，FKGL最高，为14.2。

结论本研究调查了AI在医学，特别是泌尿外科中的应用，并有助于确定大语言模型聊天机器人是否可以作为免费医学信息的可靠来源。所有五个AI聊天机器人在6分李克特量表上平均至少能达到“有点准确”。在可读性方面，所有五个AI聊天机器人的平均弗莱什阅读简易度得分均低于50，且高于十年级水平。在这项小规模研究中，发现每个AI聊天机器人的可读性得分之间存在一些显著差异。然而，它们的准确性之间没有发现显著差异。因此，我们的研究表明，主要的AI聊天机器人在正确性方面可能表现相似，但在被公众理解的难易程度上有所不同。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/90d2/11427961/6e97d751e296/cureus-0016-00000067996-i01.jpg

相似文献

Accuracy and Readability of Artificial Intelligence Chatbot Responses to Vasectomy-Related Questions: Public Beware.人工智能聊天机器人对输精管切除术相关问题回答的准确性和可读性：公众需谨慎。

Cureus. 2024 Aug 28;16(8):e67996. doi: 10.7759/cureus.67996. eCollection 2024 Aug.

Assessing the Readability of Patient Education Materials on Cardiac Catheterization From Artificial Intelligence Chatbots: An Observational Cross-Sectional Study.评估人工智能聊天机器人提供的心脏导管插入术患者教育材料的可读性：一项观察性横断面研究。

Cureus. 2024 Jul 4;16(7):e63865. doi: 10.7759/cureus.63865. eCollection 2024 Jul.

Performance of Artificial Intelligence Chatbots on Glaucoma Questions Adapted From Patient Brochures.人工智能聊天机器人对改编自患者手册的青光眼问题的回答情况。

Cureus. 2024 Mar 23;16(3):e56766. doi: 10.7759/cureus.56766. eCollection 2024 Mar.

Dr. Google to Dr. ChatGPT: assessing the content and quality of artificial intelligence-generated medical information on appendicitis.谷歌博士对 ChatGPT 博士：评估人工智能生成的关于阑尾炎的医学信息的内容和质量。

Surg Endosc. 2024 May;38(5):2887-2893. doi: 10.1007/s00464-024-10739-5. Epub 2024 Mar 5.

Assessment of readability, reliability, and quality of ChatGPT®, BARD®, Gemini®, Copilot®, Perplexity® responses on palliative care.评估 ChatGPT®、BARD®、 Gemini®、Copilot®、Perplexity® 在姑息治疗方面的可读性、可靠性和质量。

Medicine (Baltimore). 2024 Aug 16;103(33):e39305. doi: 10.1097/MD.0000000000039305.

Assessing the readability, reliability, and quality of artificial intelligence chatbot responses to the 100 most searched queries about cardiopulmonary resuscitation: An observational study.评估人工智能聊天机器人对心肺复苏术 100 个最常见查询的回答的易读性、可靠性和质量：一项观察性研究。

Medicine (Baltimore). 2024 May 31;103(22):e38352. doi: 10.1097/MD.0000000000038352.

Assessment of Artificial Intelligence Chatbot Responses to Top Searched Queries About Cancer.评估人工智能聊天机器人对癌症热门搜索查询的响应

JAMA Oncol. 2023 Oct 1;9(10):1437-1440. doi: 10.1001/jamaoncol.2023.2947.

Performance of ChatGPT-4 and Bard chatbots in responding to common patient questions on prostate cancer Lu-PSMA-617 therapy.ChatGPT-4和Bard聊天机器人在回答关于前列腺癌Lu-PSMA-617疗法常见患者问题方面的表现

Front Oncol. 2024 Jul 12;14:1386718. doi: 10.3389/fonc.2024.1386718. eCollection 2024.

Is ChatGPT a Reliable Source of Patient Information on Asthma?ChatGPT是哮喘患者信息的可靠来源吗？

Cureus. 2024 Jul 8;16(7):e64114. doi: 10.7759/cureus.64114. eCollection 2024 Jul.

Comparison of artificial intelligence large language model chatbots in answering frequently asked questions in anaesthesia.人工智能大语言模型聊天机器人在回答麻醉常见问题方面的比较。

BJA Open. 2024 May 8;10:100280. doi: 10.1016/j.bjao.2024.100280. eCollection 2024 Jun.

引用本文的文献

Comparison of the readability of ChatGPT and Bard in medical communication: a meta-analysis.ChatGPT与Bard在医学交流中的可读性比较：一项荟萃分析。

BMC Med Inform Decis Mak. 2025 Sep 1;25(1):325. doi: 10.1186/s12911-025-03035-2.

Utilization of artificial intelligence in Men's Health: Opportunities for innovation and quality improvement.人工智能在男性健康领域的应用：创新与质量提升的机遇。

Int J Impot Res. 2025 Jun 27. doi: 10.1038/s41443-025-01112-8.

Evaluating the Reliability and Quality of Sarcoidosis-Related Information Provided by AI Chatbots.评估人工智能聊天机器人提供的结节病相关信息的可靠性和质量。

Healthcare (Basel). 2025 Jun 5;13(11):1344. doi: 10.3390/healthcare13111344.

Assessment of artificial intelligence performance in answering questions on onabotulinum toxin and sacral neuromodulation.评估人工智能在回答关于A型肉毒毒素和骶神经调节问题时的表现。

Investig Clin Urol. 2025 May;66(3):188-193. doi: 10.4111/icu.20250040.

Comparing vasectomy techniques, recovery and complications: tips and tricks.输精管切除术技术、恢复情况及并发症的比较：技巧与窍门

Int J Impot Res. 2025 Jan 31. doi: 10.1038/s41443-025-01018-5.

本文引用的文献

Surg Endosc. 2024 May;38(5):2887-2893. doi: 10.1007/s00464-024-10739-5. Epub 2024 Mar 5.

Can ChatGPT help patients answer their otolaryngology questions?ChatGPT能帮助患者解答他们的耳鼻喉科问题吗？

Laryngoscope Investig Otolaryngol. 2023 Dec 9;9(1):e1193. doi: 10.1002/lio2.1193. eCollection 2024 Feb.

Assessing the Performance of Chat Generative Pretrained Transformer (ChatGPT) in Answering Andrology-Related Questions.评估聊天生成预训练变换器（ChatGPT）回答男科相关问题的性能。

Urol Res Pract. 2023 Nov;49(6):365-369. doi: 10.5152/tud.2023.23171.

ChatGPT-4: An assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination.ChatGPT-4：美国医师执照考试中人工智能聊天机器人的升级评估。

Med Teach. 2024 Mar;46(3):366-372. doi: 10.1080/0142159X.2023.2249588. Epub 2023 Oct 15.

Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.比较分析 ChatGPT-3.5、ChatGPT-4.0 和谷歌巴德在近视防控方面的表现：大型语言模型的基准测试。

EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.

Evaluating the performance of ChatGPT in answering questions related to pediatric urology.评估ChatGPT在回答与小儿泌尿外科相关问题方面的表现。

J Pediatr Urol. 2024 Feb;20(1):26.e1-26.e5. doi: 10.1016/j.jpurol.2023.08.003. Epub 2023 Aug 7.

New possibilities for medical support systems utilizing artificial intelligence (AI) and data platforms.利用人工智能 (AI) 和数据平台的医疗支持系统的新可能性。

Biosci Trends. 2023 Jul 11;17(3):186-189. doi: 10.5582/bst.2023.01138. Epub 2023 Jun 26.

ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations.医学领域的ChatGPT：其应用、优势、局限性、未来前景及伦理考量概述

Front Artif Intell. 2023 May 4;6:1169595. doi: 10.3389/frai.2023.1169595. eCollection 2023.

Role of Chat GPT in Public Health.Chat GPT 在公共卫生中的作用。

Ann Biomed Eng. 2023 May;51(5):868-869. doi: 10.1007/s10439-023-03172-7. Epub 2023 Mar 15.

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现：使用大语言模型进行人工智能辅助医学教育的潜力。

PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

人工智能聊天机器人对输精管切除术相关问题回答的准确性和可读性：公众需谨慎。

Accuracy and Readability of Artificial Intelligence Chatbot Responses to Vasectomy-Related Questions: Public Beware.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献