Suppr超能文献

评估GPT、Bard和必应聊天机器人在基本生命支持场景中回答的正确性和可靠性。

Evaluation of correctness and reliability of GPT, Bard, and Bing chatbots' responses in basic life support scenarios.

作者信息

Aqavil-Jahromi Saeed, Eftekhari Mohammad, Akbari Hamideh, Aligholi-Zahraie Mehrnoosh

机构信息

Department of Emergency Medicine, Imam Khomeini Hospital Complex, Tehran University of Medical Sciences, Tehran, Iran.

Prehospital and Hospital Emergency Research Center, Tehran University of Medical Sciences, Tehran, Iran.

出版信息

Sci Rep. 2025 Apr 3;15(1):11429. doi: 10.1038/s41598-024-82948-w.

Abstract

Timely recognition and initiation of basic life support (BLS) before emergency medical services arrive significantly improve survival rates and neurological outcomes. In an era where health information-seeking behaviors have shifted toward online sources, chatbots powered by generative artificial intelligence (AI) are emerging as potential tools for providing immediate health-related guidance. This study investigates the reliability of AI chatbots, specifically GPT-3.5, GPT-4, Bard, and Bing, in responding to BLS scenarios. A cross-sectional study was conducted using six scenarios adapted from the BLS. Objective Structured Clinical Examination (OSCE) by United Medical Education. These scenarios covering adult, pediatric, and infant emergencies, were presented to each chatbot on two occasions, one week apart. Responses were evaluated by a board-certified emergency medicine professor from Tehran University of Medical Sciences, using a checklist based on BLS-OSCE standards. Correctness was assessed, and reliability was measured using Cohen's kappa coefficient. GPT-4 demonstrated the highest correctness in adult scenarios (85% correct responses), while Bard showed 60% correctness. GPT-3.5 and Bing performed poorly across all scenarios. Bard achieved a correctness rate of 52.17% in pediatric scenarios, but all chatbots scored below 44% in infant scenarios. Cohen's kappa indicated substantial reliability for GPT-4 (k = 0.649) and GPT-3.5 (k = 0.645), moderate reliability for Bing (k = 0.503), and fair reliability for Bard (k = 0.357). While GPT-4 showed the highest correctness and reliability in adult BLS situations, all tested chatbots struggled significantly in pediatric and infant cases. Furthermore, none of the chatbots consistently adhered to BLS guidelines, raising concerns about their potential use in real-life emergencies. Based on these findings, AI chatbots in their current form can only be relied upon to guide bystanders through life-saving procedures with human supervision.

摘要

在紧急医疗服务到达之前及时识别并启动基本生命支持(BLS)可显著提高生存率和神经功能预后。在一个健康信息获取行为已转向在线资源的时代,由生成式人工智能(AI)驱动的聊天机器人正成为提供即时健康相关指导的潜在工具。本研究调查了AI聊天机器人,特别是GPT-3.5、GPT-4、Bard和必应,在应对BLS场景时的可靠性。使用了由联合医学教育机构改编自BLS客观结构化临床考试(OSCE)的六个场景进行了一项横断面研究。这些涵盖成人、儿童和婴儿紧急情况的场景分两次呈现给每个聊天机器人,间隔一周。由德黑兰医科大学一位获得董事会认证的急诊医学教授使用基于BLS-OSCE标准的检查表对回复进行评估。评估正确性,并使用科恩kappa系数测量可靠性。GPT-4在成人场景中表现出最高的正确性(85%的正确回复),而Bard的正确性为60%。GPT-3.5和必应在所有场景中的表现都很差。Bard在儿童场景中的正确率为52.17%,但在婴儿场景中所有聊天机器人的得分均低于44%。科恩kappa系数表明GPT-4(k = 0.649)和GPT-3.5(k = 0.645)具有高度可靠性,必应(k = 0.503)具有中度可靠性,Bard(k = 0.357)具有一般可靠性。虽然GPT-4在成人BLS情况下表现出最高的正确性和可靠性,但所有测试的聊天机器人在儿童和婴儿病例中都存在显著困难。此外,没有一个聊天机器人始终遵循BLS指南,这引发了对它们在现实生活紧急情况中潜在用途的担忧。基于这些发现,当前形式的AI聊天机器人只能在人类监督下用于指导旁观者进行救生程序。

相似文献

本文引用的文献

1
Chatbot Reliability in Managing Thoracic Surgical Clinical Scenarios.胸外科临床场景中聊天机器人的可靠性。
Ann Thorac Surg. 2024 Jul;118(1):275-281. doi: 10.1016/j.athoracsur.2024.03.023. Epub 2024 Apr 2.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验