Lafuente Carlos, Rahim Mehdi
DSIC, Universitat Politècnica de València, Valencia, Spain.
R&D, Air Liquide, Les Loges-en-Josas, France.
Int J Comput Assist Radiol Surg. 2025 Jul 16. doi: 10.1007/s11548-025-03475-1.
Large language models (LLMs) have a significant potential in healthcare due to their ability to process unstructured text from electronic health records (EHRs) and to generate knowledge with few or no training. In this study, we investigate the effectiveness of LLMs for clinical decision support, specifically in the context of emergency department triage, where the volume of textual data is minimal compared to other scenarios such as making a clinical diagnosis.
We benchmark LLMs with traditional machine learning (ML) approaches using the Emergency Severity Index (ESI) as the gold standard criteria of triage. The benchmark includes general purpose, specialised, and fine-tuned LLMs. All models are prompted to predict ESI score from a EHRs. We use a balanced subset (n = 1000) from MIMIC-IV-ED, a large database containing records of admissions to the emergency department of Beth Israel Deaconess Medical Center.
Our findings show that the best-performing models have an average F1-score below 0.60. Also, while zero-shot and fine-tuned LLMs can outperform standard ML models, their performance is surpassed by ML models augmented with features derived from LLMs or knowledge graphs.
LLMs show value for clinical decision support in scenarios with limited textual data, such as emergency department triage. The study advocates for integrating LLM knowledge representation to improve existing ML models rather than using LLMs in isolation, suggesting this as a more promising approach to enhance the accuracy of automated triage systems.
大语言模型(LLMs)在医疗保健领域具有巨大潜力,因为它们能够处理电子健康记录(EHRs)中的非结构化文本,并在很少或没有训练的情况下生成知识。在本研究中,我们调查了大语言模型在临床决策支持方面的有效性,特别是在急诊科分诊的背景下,与诸如进行临床诊断等其他场景相比,这里的文本数据量最少。
我们使用急诊严重程度指数(ESI)作为分诊的金标准,将大语言模型与传统机器学习(ML)方法进行基准测试。该基准测试包括通用、专门和微调的大语言模型。所有模型都被要求根据电子健康记录预测ESI评分。我们使用了MIMIC-IV-ED中的一个平衡子集(n = 1000),MIMIC-IV-ED是一个包含贝斯以色列女执事医疗中心急诊科入院记录的大型数据库。
我们的研究结果表明,表现最佳的模型平均F1分数低于0.60。此外,虽然零样本和微调的大语言模型可以优于标准的机器学习模型,但它们的性能被通过从大语言模型或知识图谱派生的特征增强的机器学习模型超越。
大语言模型在文本数据有限的场景中,如急诊科分诊,显示出临床决策支持的价值。该研究主张整合大语言模型的知识表示以改进现有的机器学习模型,而不是单独使用大语言模型,这表明这是提高自动分诊系统准确性的更有前景的方法。