评估GPT、Bard和必应聊天机器人在基本生命支持场景中回答的正确性和可靠性。

Evaluation of correctness and reliability of GPT, Bard, and Bing chatbots' responses in basic life support scenarios.

作者信息

Aqavil-Jahromi Saeed, Eftekhari Mohammad, Akbari Hamideh, Aligholi-Zahraie Mehrnoosh

机构信息

Department of Emergency Medicine, Imam Khomeini Hospital Complex, Tehran University of Medical Sciences, Tehran, Iran.

Prehospital and Hospital Emergency Research Center, Tehran University of Medical Sciences, Tehran, Iran.

出版信息

Sci Rep. 2025 Apr 3;15(1):11429. doi: 10.1038/s41598-024-82948-w.

DOI:10.1038/s41598-024-82948-w

PMID:40180985

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11968784/

Abstract

Timely recognition and initiation of basic life support (BLS) before emergency medical services arrive significantly improve survival rates and neurological outcomes. In an era where health information-seeking behaviors have shifted toward online sources, chatbots powered by generative artificial intelligence (AI) are emerging as potential tools for providing immediate health-related guidance. This study investigates the reliability of AI chatbots, specifically GPT-3.5, GPT-4, Bard, and Bing, in responding to BLS scenarios. A cross-sectional study was conducted using six scenarios adapted from the BLS. Objective Structured Clinical Examination (OSCE) by United Medical Education. These scenarios covering adult, pediatric, and infant emergencies, were presented to each chatbot on two occasions, one week apart. Responses were evaluated by a board-certified emergency medicine professor from Tehran University of Medical Sciences, using a checklist based on BLS-OSCE standards. Correctness was assessed, and reliability was measured using Cohen's kappa coefficient. GPT-4 demonstrated the highest correctness in adult scenarios (85% correct responses), while Bard showed 60% correctness. GPT-3.5 and Bing performed poorly across all scenarios. Bard achieved a correctness rate of 52.17% in pediatric scenarios, but all chatbots scored below 44% in infant scenarios. Cohen's kappa indicated substantial reliability for GPT-4 (k = 0.649) and GPT-3.5 (k = 0.645), moderate reliability for Bing (k = 0.503), and fair reliability for Bard (k = 0.357). While GPT-4 showed the highest correctness and reliability in adult BLS situations, all tested chatbots struggled significantly in pediatric and infant cases. Furthermore, none of the chatbots consistently adhered to BLS guidelines, raising concerns about their potential use in real-life emergencies. Based on these findings, AI chatbots in their current form can only be relied upon to guide bystanders through life-saving procedures with human supervision.

摘要

在紧急医疗服务到达之前及时识别并启动基本生命支持（BLS）可显著提高生存率和神经功能预后。在一个健康信息获取行为已转向在线资源的时代，由生成式人工智能（AI）驱动的聊天机器人正成为提供即时健康相关指导的潜在工具。本研究调查了AI聊天机器人，特别是GPT-3.5、GPT-4、Bard和必应，在应对BLS场景时的可靠性。使用了由联合医学教育机构改编自BLS客观结构化临床考试（OSCE）的六个场景进行了一项横断面研究。这些涵盖成人、儿童和婴儿紧急情况的场景分两次呈现给每个聊天机器人，间隔一周。由德黑兰医科大学一位获得董事会认证的急诊医学教授使用基于BLS-OSCE标准的检查表对回复进行评估。评估正确性，并使用科恩kappa系数测量可靠性。GPT-4在成人场景中表现出最高的正确性（85%的正确回复），而Bard的正确性为60%。GPT-3.5和必应在所有场景中的表现都很差。Bard在儿童场景中的正确率为52.17%，但在婴儿场景中所有聊天机器人的得分均低于44%。科恩kappa系数表明GPT-4（k = 0.649）和GPT-3.5（k = 0.645）具有高度可靠性，必应（k = 0.503）具有中度可靠性，Bard（k = 0.357）具有一般可靠性。虽然GPT-4在成人BLS情况下表现出最高的正确性和可靠性，但所有测试的聊天机器人在儿童和婴儿病例中都存在显著困难。此外，没有一个聊天机器人始终遵循BLS指南，这引发了对它们在现实生活紧急情况中潜在用途的担忧。基于这些发现，当前形式的AI聊天机器人只能在人类监督下用于指导旁观者进行救生程序。

相似文献

Evaluation of correctness and reliability of GPT, Bard, and Bing chatbots' responses in basic life support scenarios.评估GPT、Bard和必应聊天机器人在基本生命支持场景中回答的正确性和可靠性。

Sci Rep. 2025 Apr 3;15(1):11429. doi: 10.1038/s41598-024-82948-w.

Accuracy of Prospective Assessments of 4 Large Language Model Chatbot Responses to Patient Questions About Emergency Care: Experimental Comparative Study.前瞻性评估 4 种大型语言模型聊天机器人对患者关于急救护理问题的回答的准确性：实验性对比研究。

J Med Internet Res. 2024 Nov 4;26:e60291. doi: 10.2196/60291.

Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics.人工智能聊天机器人作为牙髓学公共信息源的有效性和可靠性。

Int Endod J. 2024 Mar;57(3):305-314. doi: 10.1111/iej.14014. Epub 2023 Dec 20.

Performance of AI chatbots on controversial topics in oral medicine, pathology, and radiology.人工智能聊天机器人在口腔医学、病理学和放射学领域的争议性话题上的表现。

Oral Surg Oral Med Oral Pathol Oral Radiol. 2024 May;137(5):508-514. doi: 10.1016/j.oooo.2024.01.015. Epub 2024 Feb 6.

Performance of Artificial Intelligence Chatbots on Glaucoma Questions Adapted From Patient Brochures.人工智能聊天机器人对改编自患者手册的青光眼问题的回答情况。

Cureus. 2024 Mar 23;16(3):e56766. doi: 10.7759/cureus.56766. eCollection 2024 Mar.

Accuracy and Readability of Artificial Intelligence Chatbot Responses to Vasectomy-Related Questions: Public Beware.人工智能聊天机器人对输精管切除术相关问题回答的准确性和可读性：公众需谨慎。

Cureus. 2024 Aug 28;16(8):e67996. doi: 10.7759/cureus.67996. eCollection 2024 Aug.

Evaluation of AI-generated responses by different artificial intelligence chatbots to the clinical decision-making case-based questions in oral and maxillofacial surgery.评估不同人工智能聊天机器人对口腔颌面外科基于临床决策案例问题的人工智能生成回复。

Oral Surg Oral Med Oral Pathol Oral Radiol. 2024 Jun;137(6):587-593. doi: 10.1016/j.oooo.2024.02.018. Epub 2024 Mar 6.

Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study.ChatGPT、Bard、Claude 和 Bing 在秘鲁国家医师执照考试中的表现：一项横断面研究。

J Educ Eval Health Prof. 2023;20:30. doi: 10.3352/jeehp.2023.20.30. Epub 2023 Nov 20.

Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5 and Humans in Clinical Chemistry Multiple-Choice Questions.人类与人工智能：ChatGPT-4在临床化学选择题方面表现优于必应、巴德、ChatGPT-3.5和人类。

Adv Med Educ Pract. 2024 Sep 20;15:857-871. doi: 10.2147/AMEP.S479801. eCollection 2024.

Talking technology: exploring chatbots as a tool for cataract patient education.技术漫谈：探索聊天机器人作为白内障患者教育工具的作用

Clin Exp Optom. 2025 Jan;108(1):56-64. doi: 10.1080/08164622.2023.2298812. Epub 2024 Jan 9.

引用本文的文献

Artificial Intelligence Chatbots in Pediatric Emergencies: A Reliable Lifeline or a Risk?儿科急诊中的人工智能聊天机器人：可靠的生命线还是风险？

Cureus. 2025 Aug 1;17(8):e89234. doi: 10.7759/cureus.89234. eCollection 2025 Aug.

本文引用的文献

Chatbot Reliability in Managing Thoracic Surgical Clinical Scenarios.胸外科临床场景中聊天机器人的可靠性。

Ann Thorac Surg. 2024 Jul;118(1):275-281. doi: 10.1016/j.athoracsur.2024.03.023. Epub 2024 Apr 2.

Assessing the Accuracy and Reliability of AI-Generated Responses to Patient Questions Regarding Spine Surgery.评估人工智能生成的针对脊柱手术相关患者问题的回复的准确性和可靠性。

J Bone Joint Surg Am. 2024 Jun 19;106(12):1136-1142. doi: 10.2106/JBJS.23.00914. Epub 2024 Feb 9.

Clinical questions on advanced life support answered by artificial intelligence. A comparison between ChatGPT, Google Bard and Microsoft Copilot.人工智能回答的关于高级生命支持的临床问题。ChatGPT、谷歌巴德和微软必应助手之间的比较。

Resuscitation. 2024 Feb;195:110114. doi: 10.1016/j.resuscitation.2024.110114. Epub 2024 Jan 9.

Testing ChatGPT ability to answer laypeople questions about cardiac arrest and cardiopulmonary resuscitation.测试 ChatGPT 回答非专业人士关于心脏骤停和心肺复苏问题的能力。

Resuscitation. 2024 Jan;194:110077. doi: 10.1016/j.resuscitation.2023.110077. Epub 2023 Dec 9.

"ChatGPT, Can You Help Me Save My Child's Life?" - Diagnostic Accuracy and Supportive Capabilities to Lay Rescuers by ChatGPT in Prehospital Basic Life Support and Paediatric Advanced Life Support Cases - An In-silico Analysis.“ChatGPT，你能帮我救我孩子的命吗？”——ChatGPT 在院前基础生命支持和儿科高级生命支持病例中对非专业救援人员的诊断准确性和支持能力——一项计算机模拟分析。

J Med Syst. 2023 Nov 21;47(1):123. doi: 10.1007/s10916-023-02019-x.

Large Language Model (LLM)-Powered Chatbots Fail to Generate Guideline-Consistent Content on Resuscitation and May Provide Potentially Harmful Advice.大型语言模型 (LLM) 驱动的聊天机器人无法生成与复苏指南一致的内容，并且可能提供潜在有害的建议。

Prehosp Disaster Med. 2023 Dec;38(6):757-763. doi: 10.1017/S1049023X23006568. Epub 2023 Nov 6.

Accuracy and Reliability of Chatbot Responses to Physician Questions.聊天机器人对医生提问回答的准确性和可靠性。

JAMA Netw Open. 2023 Oct 2;6(10):e2336483. doi: 10.1001/jamanetworkopen.2023.36483.

Large Language Model-based Chatbot as a Source of Advice on First Aid in Heart Attack.基于大语言模型的聊天机器人作为心脏病发作急救建议的来源。

Curr Probl Cardiol. 2024 Jan;49(1 Pt A):102048. doi: 10.1016/j.cpcardiol.2023.102048. Epub 2023 Aug 26.

Performance of an artificial intelligence-based chatbot when acting as EMS dispatcher in a cardiac arrest scenario.基于人工智能的聊天机器人在心脏骤停场景中充当急救医疗服务调度员时的表现。

Intern Emerg Med. 2023 Nov;18(8):2449-2452. doi: 10.1007/s11739-023-03399-1. Epub 2023 Aug 21.

Reliability of Medical Information Provided by ChatGPT: Assessment Against Clinical Guidelines and Patient Information Quality Instrument.ChatGPT 提供的医学信息的可靠性：与临床指南和患者信息质量工具的评估。

J Med Internet Res. 2023 Jun 30;25:e47479. doi: 10.2196/47479.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验