基于医院健康数据评估大型语言模型以实现自动急诊分诊。

Evaluating large language models on hospital health data for automated emergency triage.

作者信息

Lafuente Carlos, Rahim Mehdi

机构信息

DSIC, Universitat Politècnica de València, Valencia, Spain.

R&D, Air Liquide, Les Loges-en-Josas, France.

出版信息

Int J Comput Assist Radiol Surg. 2025 Jul 16. doi: 10.1007/s11548-025-03475-1.

DOI:10.1007/s11548-025-03475-1

PMID:40668511

Abstract

PURPOSE

Large language models (LLMs) have a significant potential in healthcare due to their ability to process unstructured text from electronic health records (EHRs) and to generate knowledge with few or no training. In this study, we investigate the effectiveness of LLMs for clinical decision support, specifically in the context of emergency department triage, where the volume of textual data is minimal compared to other scenarios such as making a clinical diagnosis.

METHODS

We benchmark LLMs with traditional machine learning (ML) approaches using the Emergency Severity Index (ESI) as the gold standard criteria of triage. The benchmark includes general purpose, specialised, and fine-tuned LLMs. All models are prompted to predict ESI score from a EHRs. We use a balanced subset (n = 1000) from MIMIC-IV-ED, a large database containing records of admissions to the emergency department of Beth Israel Deaconess Medical Center.

RESULTS

Our findings show that the best-performing models have an average F1-score below 0.60. Also, while zero-shot and fine-tuned LLMs can outperform standard ML models, their performance is surpassed by ML models augmented with features derived from LLMs or knowledge graphs.

CONCLUSION

LLMs show value for clinical decision support in scenarios with limited textual data, such as emergency department triage. The study advocates for integrating LLM knowledge representation to improve existing ML models rather than using LLMs in isolation, suggesting this as a more promising approach to enhance the accuracy of automated triage systems.

摘要

目的

大语言模型（LLMs）在医疗保健领域具有巨大潜力，因为它们能够处理电子健康记录（EHRs）中的非结构化文本，并在很少或没有训练的情况下生成知识。在本研究中，我们调查了大语言模型在临床决策支持方面的有效性，特别是在急诊科分诊的背景下，与诸如进行临床诊断等其他场景相比，这里的文本数据量最少。

方法

我们使用急诊严重程度指数（ESI）作为分诊的金标准，将大语言模型与传统机器学习（ML）方法进行基准测试。该基准测试包括通用、专门和微调的大语言模型。所有模型都被要求根据电子健康记录预测ESI评分。我们使用了MIMIC-IV-ED中的一个平衡子集（n = 1000），MIMIC-IV-ED是一个包含贝斯以色列女执事医疗中心急诊科入院记录的大型数据库。