Suppr超能文献

评估大语言和大推理模型作为急诊内科决策支持工具的作用。

Evaluating large language and large reasoning models as decision support tools in emergency internal medicine.

作者信息

Vrdoljak Josip, Boban Zvonimir, Males Ivan, Skrabic Roko, Kumric Marko, Ottosen Anna, Clemencau Alexander, Bozic Josko, Völker Sebastian

机构信息

University of Split, School of Medicine, Department of Pathophysiology, Split, Croatia.

University of Split, School of Medicine, Department of Medical Physics, Split, Croatia.

出版信息

Comput Biol Med. 2025 Jun;192(Pt B):110351. doi: 10.1016/j.compbiomed.2025.110351. Epub 2025 May 12.

Abstract

BACKGROUND

Large Language Models (LLMs) hold promise for clinical decision support, but their real-world performance varies. We compared three leading models (OpenAI's "o1" Large Reasoning Model (LRM), Anthropic's Claude-3.5-Sonnet, and Meta's Llama-3.2-70B) to human experts in an emergency internal medicine setting.

METHODS

We conducted a prospective comparative study on 73 anonymized patient cases from the Emergency Internal Medicine ward of the University Hospital Split, Croatia (June-September 2024). Two independent internal medicine specialists, blinded to model identity, graded the LLM-generated reports in two steps: (1) they evaluated the relevance of recommended diagnostic tests based on the patient's signs, symptoms, and medical history; (2) after reviewing the actual diagnostic test results, they assessed each model's final diagnosis, therapy plan, and follow-up recommendations. The same evaluative framework was applied to human-authored reports. Likert scales (1-4 or 1-3) were used, and statistical comparisons included the Friedman and Wilcoxon signed-rank tests.

RESULTS

The o1 model achieved a mean final rating (3.63) statistically indistinguishable from human physicians (3.67; p = 0.62). Claude-3.5-Sonnet (3.38) and Llama-3.2-70B (3.23) scored significantly lower (p < 0.01 vs. o1), largely due to errors in therapy planning and non-medication recommendations. Despite this gap, all three models demonstrated ≥90 % accuracy in final diagnoses and patient admission decisions. The o1 model correctly classified all abnormal lab values (100 %), while Claude-3.5-Sonnet and Llama-3.2-70B showed minor errors (99.5 % and 99 % accuracy, respectively).

CONCLUSIONS

When evaluated on real-world emergency cases, an advanced LLM with enhanced reasoning (o1) can match expert-level clinical performance, underscoring its potential utility as a decision-support tool.

摘要

背景

大语言模型有望用于临床决策支持,但其在现实世界中的表现各不相同。我们在急诊内科环境中,将三种领先模型(OpenAI的“o1”大推理模型(LRM)、Anthropic的Claude-3.5-Sonnet和Meta的Llama-3.2-70B)与人类专家进行了比较。

方法

我们对克罗地亚斯普利特大学医院急诊内科病房的73例匿名患者病例进行了前瞻性比较研究(2024年6月至9月)。两名独立的内科专家在不知道模型身份的情况下,分两步对大语言模型生成的报告进行评分:(1)他们根据患者的体征、症状和病史评估推荐诊断测试的相关性;(2)在查看实际诊断测试结果后,他们评估每个模型的最终诊断、治疗计划和随访建议。相同的评估框架应用于人工撰写的报告。使用李克特量表(1-4或1-3),统计比较包括弗里德曼检验和威尔科克森符号秩检验。

结果

o1模型的平均最终评分(3.63)在统计学上与人类医生的评分(3.67)无显著差异(p = 0.62)。Claude-3.5-Sonnet(3.38)和Llama-3.2-70B(3.23)的得分显著较低(与o1相比,p < 0.01),主要是由于治疗计划和非药物建议方面的错误。尽管存在这一差距,但所有三种模型在最终诊断和患者入院决策方面的准确率均≥90%。o1模型正确分类了所有异常实验室值(100%),而Claude-3.5-Sonnet和Llama-3.2-70B则出现了一些小错误(准确率分别为99.5%和99%)。

结论

在真实世界的急诊病例中进行评估时,具有增强推理能力的先进大语言模型(o1)可以达到专家级的临床表现,突显了其作为决策支持工具的潜在效用。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验