Suppr超能文献

前瞻性评估 4 种大型语言模型聊天机器人对患者关于急救护理问题的回答的准确性:实验性对比研究。

Accuracy of Prospective Assessments of 4 Large Language Model Chatbot Responses to Patient Questions About Emergency Care: Experimental Comparative Study.

机构信息

College of Natural and Agricultural Sciences, University of California - Riverside, Riverside, CA, United States.

Department of Emergency Medicine, University of California - Irvine, Orange, CA, United States.

出版信息

J Med Internet Res. 2024 Nov 4;26:e60291. doi: 10.2196/60291.

Abstract

BACKGROUND

Recent surveys indicate that 48% of consumers actively use generative artificial intelligence (AI) for health-related inquiries. Despite widespread adoption and the potential to improve health care access, scant research examines the performance of AI chatbot responses regarding emergency care advice.

OBJECTIVE

We assessed the quality of AI chatbot responses to common emergency care questions. We sought to determine qualitative differences in responses from 4 free-access AI chatbots, for 10 different serious and benign emergency conditions.

METHODS

We created 10 emergency care questions that we fed into the free-access versions of ChatGPT 3.5 (OpenAI), Google Bard, Bing AI Chat (Microsoft), and Claude AI (Anthropic) on November 26, 2023. Each response was graded by 5 board-certified emergency medicine (EM) faculty for 8 domains of percentage accuracy, presence of dangerous information, factual accuracy, clarity, completeness, understandability, source reliability, and source relevancy. We determined the correct, complete response to the 10 questions from reputable and scholarly emergency medical references. These were compiled by an EM resident physician. For the readability of the chatbot responses, we used the Flesch-Kincaid Grade Level of each response from readability statistics embedded in Microsoft Word. Differences between chatbots were determined by the chi-square test.

RESULTS

Each of the 4 chatbots' responses to the 10 clinical questions were scored across 8 domains by 5 EM faculty, for 400 assessments for each chatbot. Together, the 4 chatbots had the best performance in clarity and understandability (both 85%), intermediate performance in accuracy and completeness (both 50%), and poor performance (10%) for source relevance and reliability (mostly unreported). Chatbots contained dangerous information in 5% to 35% of responses, with no statistical difference between chatbots on this metric (P=.24). ChatGPT, Google Bard, and Claud AI had similar performances across 6 out of 8 domains. Only Bing AI performed better with more identified or relevant sources (40%; the others had 0%-10%). Flesch-Kincaid Reading level was 7.7-8.9 grade for all chatbots, except ChatGPT at 10.8, which were all too advanced for average emergency patients. Responses included both dangerous (eg, starting cardiopulmonary resuscitation with no pulse check) and generally inappropriate advice (eg, loosening the collar to improve breathing without evidence of airway compromise).

CONCLUSIONS

AI chatbots, though ubiquitous, have significant deficiencies in EM patient advice, despite relatively consistent performance. Information for when to seek urgent or emergent care is frequently incomplete and inaccurate, and patients may be unaware of misinformation. Sources are not generally provided. Patients who use AI to guide health care decisions assume potential risks. AI chatbots for health should be subject to further research, refinement, and regulation. We strongly recommend proper medical consultation to prevent potential adverse outcomes.

摘要

背景

最近的调查表明,48%的消费者积极使用生成式人工智能(AI)进行与健康相关的查询。尽管广泛采用并且有可能改善医疗保健的可及性,但几乎没有研究考察 AI 聊天机器人在紧急护理建议方面的表现。

目的

我们评估了 AI 聊天机器人对常见紧急护理问题的响应质量。我们试图确定来自 4 种免费访问的 AI 聊天机器人对 10 种不同严重和良性紧急情况的响应的定性差异。

方法

我们创建了 10 个紧急护理问题,并于 2023 年 11 月 26 日将其输入免费访问的 ChatGPT 3.5(OpenAI)、Google Bard、Bing AI Chat(Microsoft)和 Claude AI(Anthropic)中。由 5 名具有董事会认证的急诊医学(EM)教员对每个回复的 8 个领域进行评分,包括百分比准确性、存在危险信息、事实准确性、清晰度、完整性、可理解性、来源可靠性和来源相关性。我们从声誉良好和学术性的紧急医疗参考资料中确定了对 10 个问题的正确、完整的答案。这些是由一名急诊住院医师汇编的。为了评估聊天机器人回复的可读性,我们使用了 Microsoft Word 中嵌入的可读性统计信息来确定每个回复的 Flesch-Kincaid 年级水平。通过卡方检验确定聊天机器人之间的差异。

结果

每位 EM 教员对每个聊天机器人的 10 个临床问题进行了 8 个领域的评分,每个聊天机器人的评分各 400 次。4 个聊天机器人在清晰度和可理解性方面表现最好(均为 85%),在准确性和完整性方面表现中等(均为 50%),在来源相关性和可靠性方面表现较差(大多未报告)(10%)。聊天机器人的回复中包含危险信息,占 5%至 35%,但在这一指标上,聊天机器人之间没有统计学差异(P=.24)。ChatGPT、Google Bard 和 Claud AI 在 8 个领域中的 6 个领域表现相似。只有 Bing AI 表现更好,其识别或相关来源更多(40%;其他聊天机器人为 0%-10%)。Flesch-Kincaid 阅读水平为 7.7-8.9 年级,除了 ChatGPT 为 10.8 年级,对于普通急诊患者来说都太高了。回复中包括危险信息(例如,在没有脉搏检查的情况下开始心肺复苏)和普遍不适当的建议(例如,在没有气道阻塞证据的情况下松开衣领以改善呼吸)。

结论

尽管 AI 聊天机器人无处不在,但它们在 EM 患者咨询方面存在重大缺陷,尽管性能相对一致。关于何时寻求紧急或紧急护理的信息通常不完整和不准确,患者可能不知道错误信息。通常不提供来源。使用 AI 来指导医疗保健决策的患者承担潜在风险。AI 聊天机器人应接受进一步的研究、改进和监管。我们强烈建议进行适当的医疗咨询,以防止潜在的不良后果。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b49b/11574488/ebe95aa10822/jmir_v26i1e60291_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验