Vrdoljak Josip, Boban Zvonimir, Males Ivan, Skrabic Roko, Kumric Marko, Ottosen Anna, Clemencau Alexander, Bozic Josko, Völker Sebastian
University of Split, School of Medicine, Department of Pathophysiology, Split, Croatia.
University of Split, School of Medicine, Department of Medical Physics, Split, Croatia.
Comput Biol Med. 2025 Jun;192(Pt B):110351. doi: 10.1016/j.compbiomed.2025.110351. Epub 2025 May 12.
Large Language Models (LLMs) hold promise for clinical decision support, but their real-world performance varies. We compared three leading models (OpenAI's "o1" Large Reasoning Model (LRM), Anthropic's Claude-3.5-Sonnet, and Meta's Llama-3.2-70B) to human experts in an emergency internal medicine setting.
We conducted a prospective comparative study on 73 anonymized patient cases from the Emergency Internal Medicine ward of the University Hospital Split, Croatia (June-September 2024). Two independent internal medicine specialists, blinded to model identity, graded the LLM-generated reports in two steps: (1) they evaluated the relevance of recommended diagnostic tests based on the patient's signs, symptoms, and medical history; (2) after reviewing the actual diagnostic test results, they assessed each model's final diagnosis, therapy plan, and follow-up recommendations. The same evaluative framework was applied to human-authored reports. Likert scales (1-4 or 1-3) were used, and statistical comparisons included the Friedman and Wilcoxon signed-rank tests.
The o1 model achieved a mean final rating (3.63) statistically indistinguishable from human physicians (3.67; p = 0.62). Claude-3.5-Sonnet (3.38) and Llama-3.2-70B (3.23) scored significantly lower (p < 0.01 vs. o1), largely due to errors in therapy planning and non-medication recommendations. Despite this gap, all three models demonstrated ≥90 % accuracy in final diagnoses and patient admission decisions. The o1 model correctly classified all abnormal lab values (100 %), while Claude-3.5-Sonnet and Llama-3.2-70B showed minor errors (99.5 % and 99 % accuracy, respectively).
When evaluated on real-world emergency cases, an advanced LLM with enhanced reasoning (o1) can match expert-level clinical performance, underscoring its potential utility as a decision-support tool.
大语言模型有望用于临床决策支持,但其在现实世界中的表现各不相同。我们在急诊内科环境中,将三种领先模型(OpenAI的“o1”大推理模型(LRM)、Anthropic的Claude-3.5-Sonnet和Meta的Llama-3.2-70B)与人类专家进行了比较。
我们对克罗地亚斯普利特大学医院急诊内科病房的73例匿名患者病例进行了前瞻性比较研究(2024年6月至9月)。两名独立的内科专家在不知道模型身份的情况下,分两步对大语言模型生成的报告进行评分:(1)他们根据患者的体征、症状和病史评估推荐诊断测试的相关性;(2)在查看实际诊断测试结果后,他们评估每个模型的最终诊断、治疗计划和随访建议。相同的评估框架应用于人工撰写的报告。使用李克特量表(1-4或1-3),统计比较包括弗里德曼检验和威尔科克森符号秩检验。
o1模型的平均最终评分(3.63)在统计学上与人类医生的评分(3.67)无显著差异(p = 0.62)。Claude-3.5-Sonnet(3.38)和Llama-3.2-70B(3.23)的得分显著较低(与o1相比,p < 0.01),主要是由于治疗计划和非药物建议方面的错误。尽管存在这一差距,但所有三种模型在最终诊断和患者入院决策方面的准确率均≥90%。o1模型正确分类了所有异常实验室值(100%),而Claude-3.5-Sonnet和Llama-3.2-70B则出现了一些小错误(准确率分别为99.5%和99%)。
在真实世界的急诊病例中进行评估时,具有增强推理能力的先进大语言模型(o1)可以达到专家级的临床表现,突显了其作为决策支持工具的潜在效用。