人工智能大语言模型聊天机器人在回答麻醉常见问题方面的比较。

Comparison of artificial intelligence large language model chatbots in answering frequently asked questions in anaesthesia.

作者信息

Nguyen Teresa P, Carvalho Brendan, Sukhdeo Hannah, Joudi Kareem, Guo Nan, Chen Marianne, Wolpaw Jed T, Kiefer Jesse J, Byrne Melissa, Jamroz Tatiana, Mootz Allison A, Reale Sharon C, Zou James, Sultan Pervez

机构信息

Department of Anesthesiology, Perioperative and Pain Medicine, Stanford School of Medicine, Stanford, CA, USA.

Department of Anesthesiology and Critical Care Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA.

出版信息

BJA Open. 2024 May 8;10:100280. doi: 10.1016/j.bjao.2024.100280. eCollection 2024 Jun.

DOI:10.1016/j.bjao.2024.100280

PMID:38764485

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11099318/

Abstract

BACKGROUND

Patients are increasingly using artificial intelligence (AI) chatbots to seek answers to medical queries.

METHODS

Ten frequently asked questions in anaesthesia were posed to three AI chatbots: ChatGPT4 (OpenAI), Bard (Google), and Bing Chat (Microsoft). Each chatbot's answers were evaluated in a randomised, blinded order by five residency programme directors from 15 medical institutions in the USA. Three medical content quality categories (accuracy, comprehensiveness, safety) and three communication quality categories (understandability, empathy/respect, and ethics) were scored between 1 and 5 (1 representing worst, 5 representing best).

RESULTS

ChatGPT4 and Bard outperformed Bing Chat (median [inter-quartile range] scores: 4 [3-4], 4 [3-4], and 3 [2-4], respectively; <0.001 with all metrics combined). All AI chatbots performed poorly in accuracy (score of ≥4 by 58%, 48%, and 36% of experts for ChatGPT4, Bard, and Bing Chat, respectively), comprehensiveness (score ≥4 by 42%, 30%, and 12% of experts for ChatGPT4, Bard, and Bing Chat, respectively), and safety (score ≥4 by 50%, 40%, and 28% of experts for ChatGPT4, Bard, and Bing Chat, respectively). Notably, answers from ChatGPT4, Bard, and Bing Chat differed statistically in comprehensiveness (ChatGPT4, 3 [2-4] Bing Chat, 2 [2-3], <0.001; and Bard 3 [2-4] Bing Chat, 2 [2-3], =0.002). All large language model chatbots performed well with no statistical difference for understandability (=0.24), empathy (=0.032), and ethics (=0.465).

CONCLUSIONS

In answering anaesthesia patient frequently asked questions, the chatbots perform well on communication metrics but are suboptimal for medical content metrics. Overall, ChatGPT4 and Bard were comparable to each other, both outperforming Bing Chat.

摘要

背景

患者越来越多地使用人工智能（AI）聊天机器人来寻求医学问题的答案。

方法

向三个AI聊天机器人提出了麻醉领域的十个常见问题：ChatGPT4（OpenAI）、Bard（谷歌）和必应聊天（微软）。来自美国15家医疗机构的五名住院医师项目主任以随机、盲法的顺序对每个聊天机器人的答案进行评估。对三个医学内容质量类别（准确性、全面性、安全性）和三个沟通质量类别（易懂性、同理心/尊重和伦理）在1至5分之间打分（1分代表最差，5分代表最佳）。

结果

ChatGPT4和Bard的表现优于必应聊天（中位数[四分位间距]分数分别为：4[3 - 4]、4[3 - 4]和3[2 - 4]；所有指标综合比较时P<0.001）。所有AI聊天机器人在准确性（ChatGPT4、Bard和必应聊天分别有58%、48%和36%的专家给出≥4分）、全面性（ChatGPT4、Bard和必应聊天分别有42%、30%和12%的专家给出≥4分）和安全性（ChatGPT�、Bard和必应聊天分别有50%、40%和28%的专家给出≥4分）方面表现不佳。值得注意的是，ChatGPT4、Bard和必应聊天的答案在全面性上有统计学差异（ChatGPT4，3[2 - 4]对比必应聊天，2[2 - 3]，P<0.001；Bard 3[2 - 4]对比必应聊天，2[2 - 3]，P = 0.002）。所有大语言模型聊天机器人在易懂性（P = 0.24）、同理心（P = 0.032）和伦理（P = 0.465）方面表现良好，无统计学差异。

结论

在回答麻醉患者常见问题时，聊天机器人在沟通指标方面表现良好，但在医学内容指标方面欠佳。总体而言，ChatGPT4和Bard表现相当，均优于必应聊天。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a00/11099318/b36f570fd299/gr1.jpg

相似文献

Comparison of artificial intelligence large language model chatbots in answering frequently asked questions in anaesthesia.人工智能大语言模型聊天机器人在回答麻醉常见问题方面的比较。

BJA Open. 2024 May 8;10:100280. doi: 10.1016/j.bjao.2024.100280. eCollection 2024 Jun.

Accuracy and Readability of Artificial Intelligence Chatbot Responses to Vasectomy-Related Questions: Public Beware.人工智能聊天机器人对输精管切除术相关问题回答的准确性和可读性：公众需谨慎。

Cureus. 2024 Aug 28;16(8):e67996. doi: 10.7759/cureus.67996. eCollection 2024 Aug.

Performance of Artificial Intelligence Chatbots on Glaucoma Questions Adapted From Patient Brochures.人工智能聊天机器人对改编自患者手册的青光眼问题的回答情况。

Cureus. 2024 Mar 23;16(3):e56766. doi: 10.7759/cureus.56766. eCollection 2024 Mar.

Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.生成式人工智能大语言模型在正畸学中的循证潜力：ChatGPT、谷歌巴德和微软必应的比较研究

Eur J Orthod. 2024 Apr 13. doi: 10.1093/ejo/cjae017.

Efficacy of AI Chats to Determine an Emergency: A Comparison Between OpenAI's ChatGPT, Google Bard, and Microsoft Bing AI Chat.人工智能聊天工具在判定紧急情况方面的效能：OpenAI的ChatGPT、谷歌巴德和微软必应人工智能聊天工具的比较

Cureus. 2023 Sep 18;15(9):e45473. doi: 10.7759/cureus.45473. eCollection 2023 Sep.

Comparison of the Audiological Knowledge of Three Chatbots: ChatGPT, Bing Chat, and Bard.三款聊天机器人的听力学知识比较：ChatGPT、必应聊天和巴德

Audiol Neurootol. 2024;29(6):457-463. doi: 10.1159/000538983. Epub 2024 May 6.

Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.评估生成式 AI 大语言模型 ChatGPT、Google Bard 和 Microsoft Bing Chat 在支持循证牙科方面的性能：比较混合方法研究。

J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.

Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics.人工智能聊天机器人作为牙髓学公共信息源的有效性和可靠性。

Int Endod J. 2024 Mar;57(3):305-314. doi: 10.1111/iej.14014. Epub 2023 Dec 20.

Quantitative Comparison of Chatbots on Common Rhinology Pathologies.常见鼻科学病症的聊天机器人定量比较。

Laryngoscope. 2024 Oct;134(10):4225-4231. doi: 10.1002/lary.31470. Epub 2024 Apr 26.

Assessing the Suitability of Artificial Intelligence-Based Chatbots as Counseling Agents for Patients with Brain Tumor: A Comprehensive Survey Analysis.评估基于人工智能的聊天机器人作为脑肿瘤患者咨询代理的适用性：全面调查分析。

World Neurosurg. 2024 Jul;187:e963-e981. doi: 10.1016/j.wneu.2024.05.023. Epub 2024 May 10.

引用本文的文献

Harnessing Generative Artificial Intelligence in Pediatric Anesthesia: Enhancing Learning, Patient Care, and Family Communication.在儿科麻醉中利用生成式人工智能：加强学习、患者护理和医患沟通。

Paediatr Anaesth. 2025 Sep;35(9):691-694. doi: 10.1111/pan.70005. Epub 2025 Jun 24.

The applications of ChatGPT and other large language models in anesthesiology and critical care: a systematic review.ChatGPT及其他大语言模型在麻醉学与重症监护中的应用：一项系统综述

Can J Anaesth. 2025 Jun 16. doi: 10.1007/s12630-025-02973-9.

Assessing the Quality, Usefulness, and Reliability of Large Language Models (ChatGPT, DeepSeek, and Gemini) in Answering General Questions Regarding Dyslexia and Dyscalculia.评估大型语言模型（ChatGPT、豆包和Gemini）在回答有关阅读障碍和计算障碍的一般问题时的质量、实用性和可靠性。（注：原文中的DeepSeek在国内一般被称为豆包）

Psychiatr Q. 2025 Jun 12. doi: 10.1007/s11126-025-10170-6.

A large language model improves clinicians' diagnostic performance in complex critical illness cases.一个大语言模型提高了临床医生在复杂重症病例中的诊断表现。

Crit Care. 2025 Jun 6;29(1):230. doi: 10.1186/s13054-025-05468-7.

本文引用的文献

Quantitative evaluation of ChatGPT versus Bard responses to anaesthesia-related queries.ChatGPT与Bard对麻醉相关问题回答的定量评估。

Br J Anaesth. 2024 Jan;132(1):169-171. doi: 10.1016/j.bja.2023.09.030. Epub 2023 Nov 7.

Large Language Models in Hematology Case Solving: A Comparative Study of ChatGPT-3.5, Google Bard, and Microsoft Bing.大语言模型在血液学病例解决中的应用：ChatGPT-3.5、谷歌巴德和微软必应的比较研究

Cureus. 2023 Aug 21;15(8):e43861. doi: 10.7759/cureus.43861. eCollection 2023 Aug.

Generative Artificial Intelligence: More of the Same or Off the Control Chart?生成式人工智能：一如既往还是超出控制范围？

Clin Chem. 2023 Oct 3;69(10):1101-1106. doi: 10.1093/clinchem/hvad129.

Comparative Evaluation of Diagnostic Accuracy Between Google Bard and Physicians.谷歌巴德与医生之间诊断准确性的比较评估

Am J Med. 2023 Nov;136(11):1119-1123.e18. doi: 10.1016/j.amjmed.2023.08.003. Epub 2023 Aug 27.

Will Generative Artificial Intelligence Chatbots Generate Delusions in Individuals Prone to Psychosis?生成式人工智能聊天机器人会在易患精神病的个体中引发妄想吗？

Schizophr Bull. 2023 Nov 29;49(6):1418-1419. doi: 10.1093/schbul/sbad128.

Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions.眼科医生与大型语言模型聊天机器人对在线患者眼部护理问题的回复比较。

JAMA Netw Open. 2023 Aug 1;6(8):e2330320. doi: 10.1001/jamanetworkopen.2023.30320.

Creation and Adoption of Large Language Models in Medicine.医学领域中大型语言模型的创建与采用。

JAMA. 2023 Sep 5;330(9):866-869. doi: 10.1001/jama.2023.14217.

Large language models in medicine.医学中的大型语言模型。

Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.

What Should ChatGPT Mean for Bioethics?ChatGPT对生物伦理学意味着什么？

Am J Bioeth. 2023 Oct;23(10):8-16. doi: 10.1080/15265161.2023.2233357. Epub 2023 Jul 13.

Large language models encode clinical knowledge.大语言模型编码临床知识。

Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

人工智能大语言模型聊天机器人在回答麻醉常见问题方面的比较。

Comparison of artificial intelligence large language model chatbots in answering frequently asked questions in anaesthesia.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献