• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

新开发的大语言模型在危重症病例中的诊断性能:一项比较研究。

Diagnostic performance of newly developed large language models in critical illness cases: A comparative study.

作者信息

Wu Xintong, Huang Yu, He Qing

机构信息

Department of Intensive Care Medicine, Affiliated Hospital of Southwest Jiaotong University, The Third People's Hospital of Chengdu, Chengdu, Sichuan, China.

Department of Intensive Care Medicine, Affiliated Hospital of Southwest Jiaotong University, The Third People's Hospital of Chengdu, Chengdu, Sichuan, China.

出版信息

Int J Med Inform. 2025 Dec;204:106088. doi: 10.1016/j.ijmedinf.2025.106088. Epub 2025 Aug 23.

DOI:10.1016/j.ijmedinf.2025.106088
PMID:40865411
Abstract

BACKGROUND

Large language models (LLMs) are increasingly used in clinical decision support, and newly developed models have demonstrated promising potential, yet their diagnostic performance for critically ill patients in intensive care unit (ICU) settings remains underexplored. This study evaluated the diagnostic accuracy, differential diagnosis quality, and response quality in critical illness cases of four newly developed LLMs.

METHODS

In this cross-sectional comparative study, four newly developed LLMs-ChatGPT-4o, ChatGPT-o3, DeepSeek-V3, and DeepSeek-R1-were evaluated using 50 critical illness cases in ICU settings from published literature. Diagnostic accuracy and response quality were compared across models.

RESULTS

A total of 50 critical illness cases were included. ChatGPT-o3 achieved the top diagnosis accuracy at 72 % (36/50; 95 % CI 0.600-0.840), followed by DeepSeek-R1 at 68 % (34/50; 95 % CI 0.540-0.800), ChatGPT-4o at 64 % (32/50; 95 % CI 0.500-0.760), and DeepSeek-V3 at 32 % (16/50; 95 % CI 0.200-0.460). ChatGPT-o3, DeepSeek-R1, and ChatGPT-4o all significantly outperformed DeepSeek-V3, with no significant differences among the three. The median differential quality score was 5.0 for ChatGPT-o3 (IQR 5.0-5.0; 95 % CI 5.0-5.0), DeepSeek-R1 (IQR 5.0-5.0; 95 % CI 5.0-5.0), and ChatGPT-4o (IQR 4.0-5.0; 95 % CI 4.5-5.0), and 4.0 for DeepSeek-V3 (IQR 3.0-5.0; 95 % CI 4.0-5.0). ChatGPT-o3 and DeepSeek-R1 scored significantly higher than DeepSeek-V3; ChatGPT-4o showed a non-significant trend toward better performance.All models received high Likert ratings for response completeness, clarity, and usefulness. ChatGPT-o3, DeepSeek-R1, and ChatGPT-4o each showed a trend toward better response quality compared to DeepSeek-V3, although no significant differences were observed among the models.

CONCLUSIONS

The newly developed models, especially the reasoning models, demonstrated strong potential in supporting diagnosis in critical illness cases in ICU settings. With further domain-specific fine-tuning, their diagnostic accuracy could be further enhanced. Notably, the open-source reasoning model DeepSeek-R1 performed competitively, suggesting strong potential for scalable deployment in resource-limited settings.

摘要

背景

大语言模型(LLMs)在临床决策支持中的应用日益广泛,新开发的模型已展现出可观的潜力,但其在重症监护病房(ICU)环境中对重症患者的诊断性能仍有待深入研究。本研究评估了四个新开发的大语言模型在重症病例中的诊断准确性、鉴别诊断质量和回答质量。

方法

在这项横断面比较研究中,使用已发表文献中50例ICU环境下的重症病例对四个新开发的大语言模型——ChatGPT - 4o、ChatGPT - o3、DeepSeek - V3和DeepSeek - R1进行评估。比较各模型的诊断准确性和回答质量。

结果

共纳入50例重症病例。ChatGPT - o3的诊断准确率最高,为72%(36/50;95%置信区间0.600 - 0.840),其次是DeepSeek - R1,为68%(34/50;95%置信区间0.540 - 0.800),ChatGPT - 4o为64%(32/50;95%置信区间0.500 - 0.760),DeepSeek - V3为32%(16/50;95%置信区间0.200 - 0.460)。ChatGPT - o3、DeepSeek - R1和ChatGPT - 4o均显著优于DeepSeek - V3,但三者之间无显著差异。ChatGPT - o3的鉴别质量中位数评分为5.0(四分位间距5.0 - 5.0;95%置信区间5.0 - 5.0),DeepSeek - R1为5.0(四分位间距5.0 - 5.0;95%置信区间5.0 - 5.0),ChatGPT - 4o为4.0 - 5.0(四分位间距4.5 - 5.0;95%置信区间4.5 - 5.0),DeepSeek - V3为4.0(四分位间距3.0 - 5.0;95%置信区间4.0 - 5.0)。ChatGPT - o3和DeepSeek - R1的评分显著高于DeepSeek - V3;ChatGPT - 4o表现出性能更优的非显著趋势。所有模型在回答完整性、清晰度和有用性方面均获得了较高的李克特评分。与DeepSeek - V3相比,ChatGPT - o3、DeepSeek - R1和ChatGPT - 4o各自均呈现出回答质量更优的趋势,尽管各模型之间未观察到显著差异。

结论

新开发的模型,尤其是推理模型,在支持ICU环境下重症病例的诊断方面展现出强大潜力。通过进一步的特定领域微调,其诊断准确性可进一步提高。值得注意的是,开源推理模型DeepSeek - R1表现出竞争力,表明在资源有限的环境中具有强大的可扩展部署潜力。

相似文献

1
Diagnostic performance of newly developed large language models in critical illness cases: A comparative study.新开发的大语言模型在危重症病例中的诊断性能:一项比较研究。
Int J Med Inform. 2025 Dec;204:106088. doi: 10.1016/j.ijmedinf.2025.106088. Epub 2025 Aug 23.
2
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.ChatGPT-4o与四个开源大语言模型基于中国罕见病目录生成诊断的性能:比较研究
J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.
3
Assessing the Role of Large Language Models Between ChatGPT and DeepSeek in Asthma Education for Bilingual Individuals: Comparative Study.评估ChatGPT和DeepSeek之间的大型语言模型在双语个体哮喘教育中的作用:比较研究
JMIR Med Inform. 2025 Aug 13;13:e65365. doi: 10.2196/65365.
4
Evaluating ChatGPT and DeepSeek in postdural puncture headache management: a comparative study with international consensus guidelines.评估ChatGPT和DeepSeek在硬膜穿刺后头痛管理中的应用:与国际共识指南的对比研究
BMC Neurol. 2025 Jul 1;25(1):264. doi: 10.1186/s12883-025-04280-8.
5
Are clinical improvements in large language models a reality? Longitudinal comparisons of ChatGPT models and DeepSeek-R1 for psychiatric assessments and interventions.大语言模型在临床上的改进成为现实了吗?ChatGPT模型与DeepSeek-R1在精神科评估与干预方面的纵向比较。
Int J Soc Psychiatry. 2025 Jul 31:207640251358071. doi: 10.1177/00207640251358071.
6
Evaluating the Reasoning Capabilities of Large Language Models for Medical Coding and Hospital Readmission Risk Stratification: Zero-Shot Prompting Approach.评估大型语言模型在医学编码和医院再入院风险分层方面的推理能力:零样本提示方法。
J Med Internet Res. 2025 Jul 30;27:e74142. doi: 10.2196/74142.
7
Clinical feasibility of AI Doctors: Evaluating the replacement potential of large language models in outpatient settings for central nervous system tumors.人工智能医生的临床可行性:评估大语言模型在中枢神经系统肿瘤门诊环境中的替代潜力。
Int J Med Inform. 2025 Jun 12;203:106013. doi: 10.1016/j.ijmedinf.2025.106013.
8
A multi-dimensional performance evaluation of large language models in dental implantology: comparison of ChatGPT, DeepSeek, Grok, Gemini and Qwen across diverse clinical scenarios.牙种植学中大型语言模型的多维性能评估:ChatGPT、百川智能、Grok、Gemini和通义千问在不同临床场景下的比较
BMC Oral Health. 2025 Jul 28;25(1):1272. doi: 10.1186/s12903-025-06619-6.
9
A Comparative Study on the Use of DeepSeek-R1 and ChatGPT-4.5 in Different Aspects of Plastic Surgery.DeepSeek-R1与ChatGPT-4.5在整形外科不同方面应用的比较研究
Aesthetic Plast Surg. 2025 Aug 11. doi: 10.1007/s00266-025-05108-z.
10
Performance of ChatGPT and DeepSeek in the Management of Postprostatectomy Uri-nary Incontinence.ChatGPT与DeepSeek在前列腺切除术后尿失禁管理中的表现。
Int Braz J Urol. 2025 Nov-Dec;51(6). doi: 10.1590/S1677-5538.IBJU.2025.0325.