Suppr超能文献

在处理常见急诊病例方面,人工智能能否与急诊医生相媲美?一项比较性能评估。

Can AI match emergency physicians in managing common emergency cases? A comparative performance evaluation.

作者信息

Gün Mehmet

机构信息

Ministry of Health, Şile State Hospital, Kumbaba Street, Şile, Istanbul, Türkiye.

出版信息

BMC Emerg Med. 2025 Jul 31;25(1):142. doi: 10.1186/s12873-025-01303-y.

Abstract

BACKGROUND

Large language models (LLMs) such as ChatGPT are increasingly explored for clinical decision support. However, their performance in high-stakes emergency scenarios remains underexamined. This study aimed to evaluate ChatGPT's diagnostic and therapeutic accuracy compared to a board-certified emergency physician across diverse emergency cases.

METHODS

This comparative study was conducted using 15 standardized emergency scenarios sourced from validated academic platforms (Geeky Medics, Life in the Fast Lane, Emergency Medicine Cases). ChatGPT (GPT-4) and a physician independently evaluated each case based on five predefined parameters: diagnosis, investigations, initial treatment, clinical safety, and decision-making complexity. Cases were scored out of 5. Concordance was categorized as high (5/5), moderate (4/5), or low (≤ 3/5). Wilson confidence intervals (95%) were calculated for each concordance category.

RESULTS

ChatGPT achieved high concordance (5/5) in 8 cases (53.3%, 95% CI: 27.6-77.0%), moderate concordance (4/5) in 4 cases (26.7%, CI: 10.3-55.4%), and low concordance (≤ 3/5) in 3 cases (20.0%, CI: 6.0-45.6%). Performance was strongest in structured, protocol-based conditions such as STEMI, DKA, and asthma. Lower performance was observed in complex scenarios like stroke, trauma with shock, and mixed acid-base disturbances.

CONCLUSION

ChatGPT showed strong alignment with emergency physician decisions in structured scenarios but lacked reliability in complex cases. While AI may enhance decision-making and education, it cannot replace the clinical reasoning of human physicians. Its role is best framed as a supportive tool rather than a substitute.

摘要

背景

诸如ChatGPT之类的大语言模型(LLMs)正越来越多地被用于临床决策支持。然而,它们在高风险紧急情况下的表现仍未得到充分研究。本研究旨在比较ChatGPT与经过委员会认证的急诊医生在各种急诊病例中的诊断和治疗准确性。

方法

本比较研究使用了15个来自经过验证的学术平台(Geeky Medics、Life in the Fast Lane、Emergency Medicine Cases)的标准化急诊场景。ChatGPT(GPT-4)和一名医生根据五个预定义参数独立评估每个病例:诊断、检查、初始治疗、临床安全性和决策复杂性。病例评分最高为5分。一致性分为高(5/5)、中(4/5)或低(≤3/5)。计算每个一致性类别的威尔逊置信区间(95%)。

结果

ChatGPT在8个病例中达到高一致性(5/5)(53.3%,95%CI:27.6-77.0%),在4个病例中达到中等一致性(4/5)(26.7%,CI:10.3-55.4%),在3个病例中达到低一致性(≤3/5)(20.0%,CI:6.0-45.6%)。在诸如ST段抬高型心肌梗死、糖尿病酮症酸中毒和哮喘等结构化、基于方案的情况下表现最强。在中风、创伤性休克和混合性酸碱紊乱等复杂场景中观察到较低的表现。

结论

ChatGPT在结构化场景中与急诊医生的决策表现出高度一致性,但在复杂病例中缺乏可靠性。虽然人工智能可能会增强决策和教育,但它不能取代人类医生的临床推理。其作用最好被定位为一种支持工具而非替代品。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验