• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ChatGPT、Gemini和DeepSeek在急诊科使用真实对话进行非关键分诊支持方面的表现。

Performance of ChatGPT, Gemini and DeepSeek for non-critical triage support using real-world conversations in emergency department.

作者信息

Lee Sukyo, Jung Sumin, Park Jong-Hak, Cho Hanjin, Moon Sungwoo, Ahn Sejoong

机构信息

Department of Emergency Medicine, Korea University Ansan Hospital, Ansan-si, 15355, Republic of Korea.

Core Research & Development Center, Korea University Ansan Hospital, Ansan-si, 15355, Republic of Korea.

出版信息

BMC Emerg Med. 2025 Sep 1;25(1):176. doi: 10.1186/s12873-025-01337-2.

DOI:10.1186/s12873-025-01337-2
PMID:40890624
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12403343/
Abstract

BACKGROUND

Timely and accurate triage is crucial for the emergency department (ED) care. Recently, there has been growing interest in applying large language models (LLMs) to support triage decision-making. However, most existing studies have evaluated these models using simulated scenarios rather than real-world clinical cases. Therefore, we evaluated the performance of multiple commercial LLMs for non-critical triage support in ED using real-world clinical conversations.

METHODS

We retrospectively analyzed real-world triage conversations prospectively collected from three tertiary hospitals in South Korea. Multiple commercial LLMs-including OpenAI GPT-4o, GPT-4.1, O3, Google Gemini 2.0 flash, Gemini 2.5 flash, Gemini 2.5 pro, DeepSeek V3, and DeepSeek R1-were evaluated for the accuracy in triaging patient urgency based solely on unsummarized dialogue. The Korean Triage and Acuity Scale (KTAS) assigned by triage nurses was used as the gold standard for evaluating the LLM classifications. Model performance was assessed under both a zero-shot prompting condition and a few-shot prompting condition that included representative examples.

RESULTS

A total of 1,057 triage cases were included in the analysis. Among the models, Gemini 2.5 flash achieved the highest accuracy (73.8%), specificity (88.9%), and PPV (94.0%). Gemini 2.5 pro demonstrated the highest sensitivity (90.9%) and F1-score (82.4%), though with lower specificity (23.3%). GPT-4.1 also showed balanced high accuracy (70.6%) and sensitivity (81.3%) with practical response times (1.79s). Performance varied widely between models and even between different versions from the same vendor. With few-shot prompting, most models showed further improvements in accuracy and F1-score.

CONCLUSIONS

LLMs can accurately triage ED patient urgency using real-world clinical conversations. Several models demonstrated both high sensitivity and acceptable response times, supporting the feasibility of LLM in non-critical triage support tools in diverse clinical environments. These findings apply to non-critical patients (KTAS 3-5), and further research should address integration with objective clinical data and real-time workflow.

摘要

背景

及时准确的分诊对于急诊科护理至关重要。最近,人们越来越有兴趣应用大语言模型(LLMs)来支持分诊决策。然而,大多数现有研究使用模拟场景而非真实世界的临床病例来评估这些模型。因此,我们使用真实世界的临床对话评估了多个商业大语言模型在急诊科非危急分诊支持方面的性能。

方法

我们回顾性分析了从韩国三家三级医院前瞻性收集的真实世界分诊对话。多个商业大语言模型——包括OpenAI GPT-4o、GPT-4.1、O3、谷歌Gemini 2.0 flash、Gemini 2.5 flash、Gemini 2.5 pro、渊亭V3和渊亭R1——仅根据未总结的对话对患者紧急程度进行分诊的准确性进行了评估。分诊护士分配的韩国分诊及 acuity 量表(KTAS)用作评估大语言模型分类的金标准。在零样本提示条件和包括代表性示例的少样本提示条件下评估模型性能。

结果

分析共纳入1057例分诊病例。在这些模型中,Gemini 2.5 flash的准确率(73.8%)、特异性(88.9%)和阳性预测值(94.0%)最高。Gemini 2.5 pro的灵敏度(90.9%)和F1分数(82.4%)最高,不过特异性较低(23.3%)。GPT-4.1也表现出平衡的高准确率(70.6%)和灵敏度(81.3%)以及实际响应时间(1.79秒)。不同模型之间甚至同一供应商的不同版本之间性能差异很大。在少样本提示下,大多数模型的准确率和F1分数进一步提高。

结论

大语言模型可以使用真实世界的临床对话准确分诊急诊科患者的紧急程度。几个模型表现出高灵敏度和可接受的响应时间,支持大语言模型在不同临床环境中的非危急分诊支持工具中的可行性。这些发现适用于非危急患者(KTAS 3-5),进一步的研究应解决与客观临床数据和实时工作流程的整合问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/72ad/12403343/78700d3d0068/12873_2025_1337_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/72ad/12403343/dc18cce6160a/12873_2025_1337_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/72ad/12403343/78700d3d0068/12873_2025_1337_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/72ad/12403343/dc18cce6160a/12873_2025_1337_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/72ad/12403343/78700d3d0068/12873_2025_1337_Fig2_HTML.jpg

相似文献

1
Performance of ChatGPT, Gemini and DeepSeek for non-critical triage support using real-world conversations in emergency department.ChatGPT、Gemini和DeepSeek在急诊科使用真实对话进行非关键分诊支持方面的表现。
BMC Emerg Med. 2025 Sep 1;25(1):176. doi: 10.1186/s12873-025-01337-2.
2
Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.分诊表现比较:大型语言模型、ChatGPT 和未经训练的急诊医生:一项对比研究。
J Med Internet Res. 2024 Jun 14;26:e53297. doi: 10.2196/53297.
3
Evaluating the Reasoning Capabilities of Large Language Models for Medical Coding and Hospital Readmission Risk Stratification: Zero-Shot Prompting Approach.评估大型语言模型在医学编码和医院再入院风险分层方面的推理能力:零样本提示方法。
J Med Internet Res. 2025 Jul 30;27:e74142. doi: 10.2196/74142.
4
A multi-dimensional performance evaluation of large language models in dental implantology: comparison of ChatGPT, DeepSeek, Grok, Gemini and Qwen across diverse clinical scenarios.牙种植学中大型语言模型的多维性能评估:ChatGPT、百川智能、Grok、Gemini和通义千问在不同临床场景下的比较
BMC Oral Health. 2025 Jul 28;25(1):1272. doi: 10.1186/s12903-025-06619-6.
5
ChatGPT-supported patient triage with voice commands in the emergency department: A prospective multicenter study.急诊科中基于语音指令的ChatGPT支持的患者分诊:一项前瞻性多中心研究。
Am J Emerg Med. 2025 Apr 17;94:63-70. doi: 10.1016/j.ajem.2025.04.040.
6
Using a Diverse Test Suite to Assess Large Language Models on Fast Health Care Interoperability Resources Knowledge: Comparative Analysis.使用多样化测试套件在快速医疗保健互操作性资源知识方面评估大语言模型:比较分析
J Med Internet Res. 2025 Aug 12;27:e73540. doi: 10.2196/73540.
7
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
8
Comparative evaluation of AI platforms "Google Gemini 2.5 Flash, Google Gemini 2.0 Flash, DeepSeek V3 and ChatGPT 4o" in solving multiple-choice questions from different subtopics of anatomy.人工智能平台“谷歌Gemini 2.5 Flash、谷歌Gemini 2.0 Flash、DeepSeek V3和ChatGPT 4o”在解答解剖学不同子主题多项选择题方面的比较评估
Surg Radiol Anat. 2025 Aug 30;47(1):193. doi: 10.1007/s00276-025-03707-8.
9
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响:比较案例研究
JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.
10
Potential of ChatGPT in youth mental health emergency triage: Comparative analysis with clinicians.ChatGPT在青少年心理健康紧急分诊中的潜力:与临床医生的比较分析
PCN Rep. 2025 Jul 15;4(3):e70159. doi: 10.1002/pcn5.70159. eCollection 2025 Sep.

本文引用的文献

1
ChatGPT-supported patient triage with voice commands in the emergency department: A prospective multicenter study.急诊科中基于语音指令的ChatGPT支持的患者分诊:一项前瞻性多中心研究。
Am J Emerg Med. 2025 Apr 17;94:63-70. doi: 10.1016/j.ajem.2025.04.040.
2
Evaluating LLM-based generative AI tools in emergency triage: A comparative study of ChatGPT Plus, Copilot Pro, and triage nurses.评估基于大语言模型的生成式人工智能工具在急诊分诊中的应用:ChatGPT Plus、Copilot Pro与分诊护士的对比研究
Am J Emerg Med. 2025 Mar;89:174-181. doi: 10.1016/j.ajem.2024.12.024. Epub 2024 Dec 19.
3
Comparative analysis of ChatGPT, Gemini and emergency medicine specialist in ESI triage assessment.
ChatGPT、Gemini 与急诊专科医生在急诊病情严重程度分级评估中的比较分析。
Am J Emerg Med. 2024 Jul;81:146-150. doi: 10.1016/j.ajem.2024.05.001. Epub 2024 May 3.
4
Human intelligence versus Chat-GPT: who performs better in correctly classifying patients in triage?人类智能与Chat-GPT:在分诊中对患者进行正确分类时谁表现得更好?
Am J Emerg Med. 2024 May;79:44-47. doi: 10.1016/j.ajem.2024.02.008. Epub 2024 Feb 7.
5
Performance of Google bard and ChatGPT in mass casualty incidents triage.谷歌巴德和 ChatGPT 在大规模伤亡事件分诊中的表现。
Am J Emerg Med. 2024 Jan;75:72-78. doi: 10.1016/j.ajem.2023.10.034. Epub 2023 Oct 29.
6
Triage accuracy and causes of mistriage using the Korean Triage and Acuity Scale.使用韩国分诊和 acuity 量表评估分诊准确性和分诊错误的原因。
PLoS One. 2019 Sep 6;14(9):e0216972. doi: 10.1371/journal.pone.0216972. eCollection 2019.
7
Performance of triage systems in emergency care: a systematic review and meta-analysis.分诊系统在急诊护理中的应用效果:系统评价和荟萃分析。
BMJ Open. 2019 May 28;9(5):e026471. doi: 10.1136/bmjopen-2018-026471.
8
Triage Performance in Emergency Medicine: A Systematic Review.急诊医学分诊性能:系统评价。
Ann Emerg Med. 2019 Jul;74(1):140-152. doi: 10.1016/j.annemergmed.2018.09.022. Epub 2018 Nov 22.
9
Revisions to the Canadian Emergency Department Triage and Acuity Scale (CTAS) Guidelines 2016.《2016年加拿大急诊科分诊与 acuity 量表(CTAS)指南》修订版
CJEM. 2017 Jul;19(S2):S18-S27. doi: 10.1017/cem.2017.365.
10
Confidence intervals for predictive values with an emphasis to case-control studies.重点针对病例对照研究的预测值置信区间。
Stat Med. 2007 May 10;26(10):2170-83. doi: 10.1002/sim.2677.