Suppr超能文献

新开发的大语言模型在危重症病例中的诊断性能:一项比较研究。

Diagnostic performance of newly developed large language models in critical illness cases: A comparative study.

作者信息

Wu Xintong, Huang Yu, He Qing

机构信息

Department of Intensive Care Medicine, Affiliated Hospital of Southwest Jiaotong University, The Third People's Hospital of Chengdu, Chengdu, Sichuan, China.

Department of Intensive Care Medicine, Affiliated Hospital of Southwest Jiaotong University, The Third People's Hospital of Chengdu, Chengdu, Sichuan, China.

出版信息

Int J Med Inform. 2025 Dec;204:106088. doi: 10.1016/j.ijmedinf.2025.106088. Epub 2025 Aug 23.

Abstract

BACKGROUND

Large language models (LLMs) are increasingly used in clinical decision support, and newly developed models have demonstrated promising potential, yet their diagnostic performance for critically ill patients in intensive care unit (ICU) settings remains underexplored. This study evaluated the diagnostic accuracy, differential diagnosis quality, and response quality in critical illness cases of four newly developed LLMs.

METHODS

In this cross-sectional comparative study, four newly developed LLMs-ChatGPT-4o, ChatGPT-o3, DeepSeek-V3, and DeepSeek-R1-were evaluated using 50 critical illness cases in ICU settings from published literature. Diagnostic accuracy and response quality were compared across models.

RESULTS

A total of 50 critical illness cases were included. ChatGPT-o3 achieved the top diagnosis accuracy at 72 % (36/50; 95 % CI 0.600-0.840), followed by DeepSeek-R1 at 68 % (34/50; 95 % CI 0.540-0.800), ChatGPT-4o at 64 % (32/50; 95 % CI 0.500-0.760), and DeepSeek-V3 at 32 % (16/50; 95 % CI 0.200-0.460). ChatGPT-o3, DeepSeek-R1, and ChatGPT-4o all significantly outperformed DeepSeek-V3, with no significant differences among the three. The median differential quality score was 5.0 for ChatGPT-o3 (IQR 5.0-5.0; 95 % CI 5.0-5.0), DeepSeek-R1 (IQR 5.0-5.0; 95 % CI 5.0-5.0), and ChatGPT-4o (IQR 4.0-5.0; 95 % CI 4.5-5.0), and 4.0 for DeepSeek-V3 (IQR 3.0-5.0; 95 % CI 4.0-5.0). ChatGPT-o3 and DeepSeek-R1 scored significantly higher than DeepSeek-V3; ChatGPT-4o showed a non-significant trend toward better performance.All models received high Likert ratings for response completeness, clarity, and usefulness. ChatGPT-o3, DeepSeek-R1, and ChatGPT-4o each showed a trend toward better response quality compared to DeepSeek-V3, although no significant differences were observed among the models.

CONCLUSIONS

The newly developed models, especially the reasoning models, demonstrated strong potential in supporting diagnosis in critical illness cases in ICU settings. With further domain-specific fine-tuning, their diagnostic accuracy could be further enhanced. Notably, the open-source reasoning model DeepSeek-R1 performed competitively, suggesting strong potential for scalable deployment in resource-limited settings.

摘要

背景

大语言模型(LLMs)在临床决策支持中的应用日益广泛,新开发的模型已展现出可观的潜力,但其在重症监护病房(ICU)环境中对重症患者的诊断性能仍有待深入研究。本研究评估了四个新开发的大语言模型在重症病例中的诊断准确性、鉴别诊断质量和回答质量。

方法

在这项横断面比较研究中,使用已发表文献中50例ICU环境下的重症病例对四个新开发的大语言模型——ChatGPT - 4o、ChatGPT - o3、DeepSeek - V3和DeepSeek - R1进行评估。比较各模型的诊断准确性和回答质量。

结果

共纳入50例重症病例。ChatGPT - o3的诊断准确率最高,为72%(36/50;95%置信区间0.600 - 0.840),其次是DeepSeek - R1,为68%(34/50;95%置信区间0.540 - 0.800),ChatGPT - 4o为64%(32/50;95%置信区间0.500 - 0.760),DeepSeek - V3为32%(16/50;95%置信区间0.200 - 0.460)。ChatGPT - o3、DeepSeek - R1和ChatGPT - 4o均显著优于DeepSeek - V3,但三者之间无显著差异。ChatGPT - o3的鉴别质量中位数评分为5.0(四分位间距5.0 - 5.0;95%置信区间5.0 - 5.0),DeepSeek - R1为5.0(四分位间距5.0 - 5.0;95%置信区间5.0 - 5.0),ChatGPT - 4o为4.0 - 5.0(四分位间距4.5 - 5.0;95%置信区间4.5 - 5.0),DeepSeek - V3为4.0(四分位间距3.0 - 5.0;95%置信区间4.0 - 5.0)。ChatGPT - o3和DeepSeek - R1的评分显著高于DeepSeek - V3;ChatGPT - 4o表现出性能更优的非显著趋势。所有模型在回答完整性、清晰度和有用性方面均获得了较高的李克特评分。与DeepSeek - V3相比,ChatGPT - o3、DeepSeek - R1和ChatGPT - 4o各自均呈现出回答质量更优的趋势,尽管各模型之间未观察到显著差异。

结论

新开发的模型,尤其是推理模型,在支持ICU环境下重症病例的诊断方面展现出强大潜力。通过进一步的特定领域微调,其诊断准确性可进一步提高。值得注意的是,开源推理模型DeepSeek - R1表现出竞争力,表明在资源有限的环境中具有强大的可扩展部署潜力。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验