Suppr超能文献

人工智能大语言模型聊天机器人在回答麻醉常见问题方面的比较。

Comparison of artificial intelligence large language model chatbots in answering frequently asked questions in anaesthesia.

作者信息

Nguyen Teresa P, Carvalho Brendan, Sukhdeo Hannah, Joudi Kareem, Guo Nan, Chen Marianne, Wolpaw Jed T, Kiefer Jesse J, Byrne Melissa, Jamroz Tatiana, Mootz Allison A, Reale Sharon C, Zou James, Sultan Pervez

机构信息

Department of Anesthesiology, Perioperative and Pain Medicine, Stanford School of Medicine, Stanford, CA, USA.

Department of Anesthesiology and Critical Care Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA.

出版信息

BJA Open. 2024 May 8;10:100280. doi: 10.1016/j.bjao.2024.100280. eCollection 2024 Jun.

Abstract

BACKGROUND

Patients are increasingly using artificial intelligence (AI) chatbots to seek answers to medical queries.

METHODS

Ten frequently asked questions in anaesthesia were posed to three AI chatbots: ChatGPT4 (OpenAI), Bard (Google), and Bing Chat (Microsoft). Each chatbot's answers were evaluated in a randomised, blinded order by five residency programme directors from 15 medical institutions in the USA. Three medical content quality categories (accuracy, comprehensiveness, safety) and three communication quality categories (understandability, empathy/respect, and ethics) were scored between 1 and 5 (1 representing worst, 5 representing best).

RESULTS

ChatGPT4 and Bard outperformed Bing Chat (median [inter-quartile range] scores: 4 [3-4], 4 [3-4], and 3 [2-4], respectively; <0.001 with all metrics combined). All AI chatbots performed poorly in accuracy (score of ≥4 by 58%, 48%, and 36% of experts for ChatGPT4, Bard, and Bing Chat, respectively), comprehensiveness (score ≥4 by 42%, 30%, and 12% of experts for ChatGPT4, Bard, and Bing Chat, respectively), and safety (score ≥4 by 50%, 40%, and 28% of experts for ChatGPT4, Bard, and Bing Chat, respectively). Notably, answers from ChatGPT4, Bard, and Bing Chat differed statistically in comprehensiveness (ChatGPT4, 3 [2-4] Bing Chat, 2 [2-3], <0.001; and Bard 3 [2-4] Bing Chat, 2 [2-3], =0.002). All large language model chatbots performed well with no statistical difference for understandability (=0.24), empathy (=0.032), and ethics (=0.465).

CONCLUSIONS

In answering anaesthesia patient frequently asked questions, the chatbots perform well on communication metrics but are suboptimal for medical content metrics. Overall, ChatGPT4 and Bard were comparable to each other, both outperforming Bing Chat.

摘要

背景

患者越来越多地使用人工智能(AI)聊天机器人来寻求医学问题的答案。

方法

向三个AI聊天机器人提出了麻醉领域的十个常见问题:ChatGPT4(OpenAI)、Bard(谷歌)和必应聊天(微软)。来自美国15家医疗机构的五名住院医师项目主任以随机、盲法的顺序对每个聊天机器人的答案进行评估。对三个医学内容质量类别(准确性、全面性、安全性)和三个沟通质量类别(易懂性、同理心/尊重和伦理)在1至5分之间打分(1分代表最差,5分代表最佳)。

结果

ChatGPT4和Bard的表现优于必应聊天(中位数[四分位间距]分数分别为:4[3 - 4]、4[3 - 4]和3[2 - 4];所有指标综合比较时P<0.001)。所有AI聊天机器人在准确性(ChatGPT4、Bard和必应聊天分别有58%、48%和36%的专家给出≥4分)、全面性(ChatGPT4、Bard和必应聊天分别有42%、30%和12%的专家给出≥4分)和安全性(ChatGPT�、Bard和必应聊天分别有50%、40%和28%的专家给出≥4分)方面表现不佳。值得注意的是,ChatGPT4、Bard和必应聊天的答案在全面性上有统计学差异(ChatGPT4,3[2 - 4]对比必应聊天,2[2 - 3],P<0.001;Bard 3[2 - 4]对比必应聊天,2[2 - 3],P = 0.002)。所有大语言模型聊天机器人在易懂性(P = 0.24)、同理心(P = 0.032)和伦理(P = 0.465)方面表现良好,无统计学差异。

结论

在回答麻醉患者常见问题时,聊天机器人在沟通指标方面表现良好,但在医学内容指标方面欠佳。总体而言,ChatGPT4和Bard表现相当,均优于必应聊天。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a00/11099318/b36f570fd299/gr1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验