Suppr超能文献

ChatGPT、Gemini和DeepSeek在急诊科使用真实对话进行非关键分诊支持方面的表现。

Performance of ChatGPT, Gemini and DeepSeek for non-critical triage support using real-world conversations in emergency department.

作者信息

Lee Sukyo, Jung Sumin, Park Jong-Hak, Cho Hanjin, Moon Sungwoo, Ahn Sejoong

机构信息

Department of Emergency Medicine, Korea University Ansan Hospital, Ansan-si, 15355, Republic of Korea.

Core Research & Development Center, Korea University Ansan Hospital, Ansan-si, 15355, Republic of Korea.

出版信息

BMC Emerg Med. 2025 Sep 1;25(1):176. doi: 10.1186/s12873-025-01337-2.

Abstract

BACKGROUND

Timely and accurate triage is crucial for the emergency department (ED) care. Recently, there has been growing interest in applying large language models (LLMs) to support triage decision-making. However, most existing studies have evaluated these models using simulated scenarios rather than real-world clinical cases. Therefore, we evaluated the performance of multiple commercial LLMs for non-critical triage support in ED using real-world clinical conversations.

METHODS

We retrospectively analyzed real-world triage conversations prospectively collected from three tertiary hospitals in South Korea. Multiple commercial LLMs-including OpenAI GPT-4o, GPT-4.1, O3, Google Gemini 2.0 flash, Gemini 2.5 flash, Gemini 2.5 pro, DeepSeek V3, and DeepSeek R1-were evaluated for the accuracy in triaging patient urgency based solely on unsummarized dialogue. The Korean Triage and Acuity Scale (KTAS) assigned by triage nurses was used as the gold standard for evaluating the LLM classifications. Model performance was assessed under both a zero-shot prompting condition and a few-shot prompting condition that included representative examples.

RESULTS

A total of 1,057 triage cases were included in the analysis. Among the models, Gemini 2.5 flash achieved the highest accuracy (73.8%), specificity (88.9%), and PPV (94.0%). Gemini 2.5 pro demonstrated the highest sensitivity (90.9%) and F1-score (82.4%), though with lower specificity (23.3%). GPT-4.1 also showed balanced high accuracy (70.6%) and sensitivity (81.3%) with practical response times (1.79s). Performance varied widely between models and even between different versions from the same vendor. With few-shot prompting, most models showed further improvements in accuracy and F1-score.

CONCLUSIONS

LLMs can accurately triage ED patient urgency using real-world clinical conversations. Several models demonstrated both high sensitivity and acceptable response times, supporting the feasibility of LLM in non-critical triage support tools in diverse clinical environments. These findings apply to non-critical patients (KTAS 3-5), and further research should address integration with objective clinical data and real-time workflow.

摘要

背景

及时准确的分诊对于急诊科护理至关重要。最近,人们越来越有兴趣应用大语言模型(LLMs)来支持分诊决策。然而,大多数现有研究使用模拟场景而非真实世界的临床病例来评估这些模型。因此,我们使用真实世界的临床对话评估了多个商业大语言模型在急诊科非危急分诊支持方面的性能。

方法

我们回顾性分析了从韩国三家三级医院前瞻性收集的真实世界分诊对话。多个商业大语言模型——包括OpenAI GPT-4o、GPT-4.1、O3、谷歌Gemini 2.0 flash、Gemini 2.5 flash、Gemini 2.5 pro、渊亭V3和渊亭R1——仅根据未总结的对话对患者紧急程度进行分诊的准确性进行了评估。分诊护士分配的韩国分诊及 acuity 量表(KTAS)用作评估大语言模型分类的金标准。在零样本提示条件和包括代表性示例的少样本提示条件下评估模型性能。

结果

分析共纳入1057例分诊病例。在这些模型中,Gemini 2.5 flash的准确率(73.8%)、特异性(88.9%)和阳性预测值(94.0%)最高。Gemini 2.5 pro的灵敏度(90.9%)和F1分数(82.4%)最高,不过特异性较低(23.3%)。GPT-4.1也表现出平衡的高准确率(70.6%)和灵敏度(81.3%)以及实际响应时间(1.79秒)。不同模型之间甚至同一供应商的不同版本之间性能差异很大。在少样本提示下,大多数模型的准确率和F1分数进一步提高。

结论

大语言模型可以使用真实世界的临床对话准确分诊急诊科患者的紧急程度。几个模型表现出高灵敏度和可接受的响应时间,支持大语言模型在不同临床环境中的非危急分诊支持工具中的可行性。这些发现适用于非危急患者(KTAS 3-5),进一步的研究应解决与客观临床数据和实时工作流程的整合问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/72ad/12403343/dc18cce6160a/12873_2025_1337_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验