Suppr超能文献

人工智能聊天机器人对输精管切除术相关问题回答的准确性和可读性:公众需谨慎。

Accuracy and Readability of Artificial Intelligence Chatbot Responses to Vasectomy-Related Questions: Public Beware.

作者信息

Carlson Jonathan A, Cheng Robin Z, Lange Alyssa, Nagalakshmi Nadiminty, Rabets John, Shah Tariq, Sindhwani Puneet

机构信息

Urology, The University of Toledo College of Medicine and Life Sciences, Toledo, USA.

Urology, The University of Toledo Medical Center, Toledo, USA.

出版信息

Cureus. 2024 Aug 28;16(8):e67996. doi: 10.7759/cureus.67996. eCollection 2024 Aug.

Abstract

Purpose Artificial intelligence (AI) has rapidly gained popularity with the growth of ChatGPT (OpenAI, San Francisco, USA) and other large-language model chatbots, and these programs have tremendous potential to impact medicine. One important area of consequence in medicine and public health is that patients may use these programs in search of answers to medical questions. Despite the increased utilization of AI chatbots by the public, there is little research to assess the reliability of ChatGPT and alternative programs when queried for medical information. This study seeks to elucidate the accuracy and readability of AI chatbots in answering patient questions regarding urology. As vasectomy is one of the most common urologic procedures, this study investigates AI-generated responses to frequently asked vasectomy-related questions. For this study, five popular and free-to-access AI platforms were utilized to undertake this investigation. Methods Fifteen vasectomy-related questions were individually queried to five AI chatbots from November-December 2023: ChatGPT (OpenAI, San Francisco, USA), Bard (Google Inc., Mountainview, USA) Bing (Microsoft, Redmond, USA) Perplexity (Perplexity AI Inc., San Francisco, USA), and Claude (Anthropic, San Francisco, USA). Responses from each platform were graded by two attending urologists, two urology research faculty, and one urological resident physician using a Likert (1-6) scale: (1-completely inaccurate, 6-completely accurate) based on comparison to existing American Urological Association guidelines. Flesch-Kincaid Grade levels (FKGL) and Flesch Reading Ease scores (FRES) (1-100) were calculated for each response. To assess differences in Likert, FRES, and FKGL, Kruskal-Wallis tests were performed using GraphPad Prism V10.1.0 (GraphPad, San Diego, USA) with Alpha set at 0.05. Results Analysis shows that ChatGPT provided the most accurate responses across the five AI chatbots with an average score of 5.04 on the Likert scale. Subsequently, Microsoft Bing (4.91), Anthropic Claude (4.65), Google Bard (4.43), and Perplexity (4.41) followed. All five chatbots were found to score, on average, higher than 4.41 corresponding to a score of at least "somewhat accurate." Google Bard received the highest Flesch Reading Ease score (49.67) and lowest Grade level (10.1) when compared to the other chatbots. Anthropic Claude scored 46.7 on the FRES and 10.55 on the FKGL. Microsoft Bing scored 45.57 on the FRES and 11.56 on the FKGL. Perplexity scored 36.4 on the FRES and 13.29 on the FKGL. ChatGPT had the lowest FRES of 30.4 and highest FKGL of 14.2. Conclusion This study investigates the use of AI in medicine, specifically urology, and it helps to determine whether large-language model chatbots can be reliable sources of freely available medical information. All five AI chatbots on average were able to achieve at least "somewhat accurate" on a 6-point Likert scale. In terms of readability, all five AI chatbots on average had Flesch Reading Ease scores of less than 50 and were higher than a 10th-grade level. In this small-scale study, there were several significant differences identified between the readability scores of each AI chatbot. However, there were no significant differences found among their accuracies. Thus, our study suggests that major AI chatbots may perform similarly in their ability to be correct but differ in their ease of being comprehended by the general public.

摘要

目的 随着ChatGPT(美国旧金山OpenAI公司)和其他大语言模型聊天机器人的发展,人工智能(AI)迅速受到欢迎,这些程序对医学具有巨大的潜在影响。医学和公共卫生领域一个重要的影响方面是患者可能会使用这些程序来寻找医学问题的答案。尽管公众对AI聊天机器人的使用有所增加,但在查询医学信息时,评估ChatGPT和其他替代程序可靠性的研究却很少。本研究旨在阐明AI聊天机器人在回答患者关于泌尿外科问题时的准确性和可读性。由于输精管切除术是最常见的泌尿外科手术之一,本研究调查了AI对常见输精管切除术相关问题的回答。在本研究中,使用了五个流行且可免费访问的AI平台进行此项调查。

方法 2023年11月至12月,分别向五个AI聊天机器人提出了15个与输精管切除术相关的问题:ChatGPT(美国旧金山OpenAI公司)、Bard(美国山景城谷歌公司)、必应(美国雷德蒙德微软公司)、Perplexity(美国旧金山Perplexity AI公司)和Claude(美国旧金山Anthropic公司)。每个平台的回答由两名泌尿外科主治医师、两名泌尿外科研究人员和一名泌尿外科住院医师根据与美国泌尿外科学会现有指南的比较,使用李克特(1 - 6)量表进行评分:(1 - 完全不准确,6 - 完全准确)。计算每个回答的弗莱什 - 金凯德年级水平(FKGL)和弗莱什阅读简易度得分(FRES)(1 - 100)。为了评估李克特量表、FRES和FKGL的差异,使用GraphPad Prism V10.1.0(美国圣地亚哥GraphPad公司)进行Kruskal - Wallis检验,α设定为0.05。

结果 分析表明,在五个AI聊天机器人中,ChatGPT提供的回答最准确,在李克特量表上的平均得分为5.04。随后是微软必应(得分4.91)、Anthropic Claude(得分4.65)、谷歌Bard(得分4.43)和Perplexity(得分4.41)。发现所有五个聊天机器人的平均得分均高于4.41,对应至少“有点准确”的分数。与其他聊天机器人相比,谷歌Bard的弗莱什阅读简易度得分最高(49.67),年级水平最低(10.1)。Anthropic Claude的FRES得分为46.7,FKGL得分为10.55。微软必应的FRES得分为45.57,FKGL得分为11.56。Perplexity的FRES得分为36.4,FKGL得分为13.29。ChatGPT的FRES最低,为30.4,FKGL最高,为14.2。

结论 本研究调查了AI在医学,特别是泌尿外科中的应用,并有助于确定大语言模型聊天机器人是否可以作为免费医学信息的可靠来源。所有五个AI聊天机器人在6分李克特量表上平均至少能达到“有点准确”。在可读性方面,所有五个AI聊天机器人的平均弗莱什阅读简易度得分均低于50,且高于十年级水平。在这项小规模研究中,发现每个AI聊天机器人的可读性得分之间存在一些显著差异。然而,它们的准确性之间没有发现显著差异。因此,我们的研究表明,主要的AI聊天机器人在正确性方面可能表现相似,但在被公众理解的难易程度上有所不同。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/90d2/11427961/6e97d751e296/cureus-0016-00000067996-i01.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验