Pasli Sinan, Yadigaroğlu Metin, Kirimli Esma Nilay, Beşer Muhammet Fatih, Unutmaz İhsan, Ayhan Asu Özden, Karakurt Büşra, Şahin Abdul Samet, Hiçyilmaz Halil İbrahim, Imamoğlu Melih
Karadeniz Technical University, School of Medicine, Department of Emergency Medicine, Trabzon, Turkey.
Samsun University, School of Medicine, Department of Emergency Medicine, Samsun, Turkey.
Am J Emerg Med. 2025 Apr 17;94:63-70. doi: 10.1016/j.ajem.2025.04.040.
Triage aims to prioritize patients according to their medical urgency by accurately evaluating their clinical conditions, managing waiting times efficiently, and improving the overall effectiveness of emergency care. This study aims to assess ChatGPT's performance in patient triage across four emergency departments with varying dynamics and to provide a detailed analysis of its strengths and weaknesses.
In this multicenter, prospective study, we compared the triage decisions made by ChatGPT-4o and the triage personnel with the gold standard decisions determined by an emergency medicine (EM) specialist. In the hospitals where we conducted the study, triage teams routinely direct patients to the appropriate ED areas based on the Emergency Severity Index (ESI) system and the hospital's local triage protocols. During the study period, the triage team collected patient data, including chief complaints, comorbidities, and vital signs, and used this information to make the initial triage decisions. An independent physician simultaneously entered the same data into ChatGPT using voice commands. At the same time, an EM specialist, present in the triage room throughout the study period, reviewed the same patient data and determined the gold standard triage decisions, strictly adhering to both the hospital's local protocols and the ESI system. Before initiating the study, we customized ChatGPT for each hospital by designing prompts that incorporated both the general principles of the ESI triage system and the specific triage rules of each hospital. The model's overall, hospital-based, and area-based performance was evaluated, with Cohen's Kappa, F1 score, and performance analyses conducted.
This study included 6657 patients. The overall agreement between triage personnel and GPT-4o with the gold standard was nearly perfect (Cohen's kappa = 0.782 and 0.833, respectively). The overall F1 score was 0.863 for the triage team, while GPT-4 achieved an F1 score of 0.897, demonstrating superior performance. ROC curve analysis showed the lowest performance in the yellow zone of a tertiary hospital (AUC = 0.75) and in the red zone of another tertiary hospital (AUC = 0.78). However, overall, AUC values greater than 0.90 were observed, indicating high accuracy.
ChatGPT generally outperformed triage personnel in patient triage across emergency departments with varying conditions, demonstrating high agreement with the gold standard decision. However, in tertiary hospitals, its performance was relatively lower in triaging patients with more complex symptoms, particularly those requiring triage to the yellow and red zones.
分诊旨在通过准确评估患者的临床状况、有效管理等待时间以及提高急诊护理的整体效率,根据患者的医疗紧急程度对其进行优先排序。本研究旨在评估ChatGPT在四个动态各异的急诊科进行患者分诊的表现,并对其优势和劣势进行详细分析。
在这项多中心前瞻性研究中,我们将ChatGPT-4o做出的分诊决策与分诊人员的决策,同由急诊医学(EM)专家确定的金标准决策进行了比较。在我们开展研究的医院中,分诊团队通常根据急诊严重程度指数(ESI)系统和医院的本地分诊协议,将患者引导至适当的急诊科区域。在研究期间,分诊团队收集患者数据,包括主要症状、合并症和生命体征,并利用这些信息做出初步分诊决策。一名独立医生同时使用语音指令将相同数据输入ChatGPT。与此同时,在整个研究期间都在分诊室的一名EM专家审查相同的患者数据,并严格遵循医院的本地协议和ESI系统,确定金标准分诊决策。在启动研究之前,我们通过设计包含ESI分诊系统的一般原则和每家医院具体分诊规则的提示,为每家医院定制了ChatGPT。对该模型的整体、基于医院和基于区域的表现进行了评估,并进行了 Cohen's Kappa分析、F1分数分析和性能分析。
本研究纳入了6657名患者。分诊人员与GPT-4o与金标准之间的总体一致性近乎完美(Cohen's kappa分别为0.782和0.833)。分诊团队的总体F1分数为0.863,而GPT-4的F1分数为0.897,表现更优。ROC曲线分析显示,在一家三级医院的黄色区域(AUC = 0.75)和另一家三级医院的红色区域(AUC = 0.78)表现最差。然而,总体而言,观察到AUC值大于0.90,表明准确性较高。
在不同条件的急诊科进行患者分诊时,ChatGPT的表现总体上优于分诊人员,与金标准决策高度一致。然而,在三级医院,对于症状较为复杂的患者,尤其是那些需要分诊到黄色和红色区域的患者,其表现相对较低。