• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

在一项复杂诊断挑战的历史对照研究中,DeepSeek-R1和GPT-4具有可比性。

DeepSeek-R1 and GPT-4 are comparable in a complex diagnostic challenge: a historical control study.

作者信息

Chan Lining, Xu Xinjie, Lv Kaiyang

机构信息

Department of Plastic Surgery, Xinhua Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, People's Republic of China.

出版信息

Int J Surg. 2025 Jun 1;111(6):4056-4059. doi: 10.1097/JS9.0000000000002386. Epub 2025 Apr 3.

DOI:10.1097/JS9.0000000000002386
PMID:40505040
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12165463/
Abstract

BACKGROUND

Large language models (LLMs) have demonstrated potential in medical diagnostics, but their accuracy in complex cases remains a subject of investigation. DeepSeek-R1, an open-source model with advanced reasoning capabilities, has gained global attention. This study evaluates the diagnostic performance of DeepSeek-R1 compared to GPT-4 in complex clinical cases.

MATERIALS AND METHODS

A historical control study was conducted using 100 clinicopathologic cases from the New England Journal of Medicine (NEJM), published between 18 August 2022, and 30 January 2025. Each case was processed using DeepSeek-R1 with a structured diagnostic prompt. The model's performance was assessed based on final diagnosis accuracy, differential diagnosis inclusion rate, ranking of correct diagnoses, and differential quality scores. Results were statistically compared to previously published GPT-4 performance data using chi-square, Mann-Whitney U, and t-tests.

RESULTS

DeepSeek-R1 correctly matched the final diagnosis in 35% of cases (35/100), which was comparable to GPT-4's accuracy (39%; P = 0.634). However, DeepSeek-R1 included the correct diagnosis in its differential list in 48% of cases, significantly lower than GPT-4 (64%; P = 0.036). DeepSeek-R1 generated longer differential diagnoses (11.9 ± 2.0 vs. 9.0 ± 1.4; P = 0.000004) but maintained a similar mean rank for correct diagnoses (1.8 ± 2.2 vs. 2.5 ± 2.5; P = 0.288566) and equivalent differential quality scores (4.2 ± 0.10 vs. 4.2 ± 1.3; P = 0.099667).

CONCLUSION

DeepSeek-R1 exhibits diagnostic accuracy comparable to GPT-4 while generating more diverse differential diagnoses. Its open-source nature and innovative reasoning strategies may enhance medical AI applications. Future studies should explore real-world clinical integration and refinement of differential diagnosis prioritization.

摘要

背景

大语言模型(LLMs)已在医学诊断中展现出潜力,但其在复杂病例中的准确性仍是一个研究课题。具有先进推理能力的开源模型DeepSeek-R1已引起全球关注。本研究评估了DeepSeek-R1在复杂临床病例中与GPT-4相比的诊断性能。

材料与方法

采用历史对照研究,使用了2022年8月18日至2025年1月30日期间发表在《新英格兰医学杂志》(NEJM)上的100例临床病理病例。每个病例使用带有结构化诊断提示的DeepSeek-R1进行处理。基于最终诊断准确性、鉴别诊断纳入率、正确诊断排名和鉴别质量评分评估模型的性能。使用卡方检验、曼-惠特尼U检验和t检验将结果与先前发表的GPT-4性能数据进行统计学比较。

结果

DeepSeek-R1在35%的病例(35/100)中正确匹配了最终诊断,这与GPT-4的准确性相当(39%;P = 0.634)。然而,DeepSeek-R1在48%的病例中将正确诊断纳入其鉴别诊断列表中,显著低于GPT-4(64%;P = 0.036)。DeepSeek-R1生成的鉴别诊断更长(11.9 ± 2.0对9.0 ± 1.4;P = 0.000004),但正确诊断的平均排名相似(1.8 ± 2.2对2.5 ± 2.5;P = 0.288566),鉴别质量评分相当(4.2 ± 0.10对4.2 ± 1.3;P = 0.099667)。

结论

DeepSeek-R1表现出与GPT-4相当的诊断准确性,同时生成更多样化的鉴别诊断。其开源性质和创新的推理策略可能会增强医学人工智能应用。未来的研究应探索其在现实世界临床中的整合以及鉴别诊断优先级的优化。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d40f/12165463/702f1ea8c89b/js9-111-4056-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d40f/12165463/a7ead02b96ef/js9-111-4056-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d40f/12165463/702f1ea8c89b/js9-111-4056-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d40f/12165463/a7ead02b96ef/js9-111-4056-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d40f/12165463/702f1ea8c89b/js9-111-4056-g002.jpg

相似文献

1
DeepSeek-R1 and GPT-4 are comparable in a complex diagnostic challenge: a historical control study.在一项复杂诊断挑战的历史对照研究中,DeepSeek-R1和GPT-4具有可比性。
Int J Surg. 2025 Jun 1;111(6):4056-4059. doi: 10.1097/JS9.0000000000002386. Epub 2025 Apr 3.
2
Performance of GPT-4o and DeepSeek-R1 in the Polish Infectious Diseases Specialty Exam.GPT-4o和DeepSeek-R1在波兰传染病专业考试中的表现。
Cureus. 2025 Apr 23;17(4):e82870. doi: 10.7759/cureus.82870. eCollection 2025 Apr.
3
Diagnostic Performance of Publicly Available Large Language Models in Corneal Diseases: A Comparison with Human Specialists.公开可用的大语言模型在角膜疾病中的诊断性能:与人类专家的比较
Diagnostics (Basel). 2025 May 13;15(10):1221. doi: 10.3390/diagnostics15101221.
4
Performance of DeepSeek, Qwen 2.5 MAX, and ChatGPT Assisting in Diagnosis of Corneal Eye Diseases, Glaucoma, and Neuro-Ophthalmology Diseases Based on Clinical Case Reports.基于临床病例报告,DeepSeek、通义千问2.5 MAX和ChatGPT在角膜眼病、青光眼和神经眼科疾病诊断中的性能表现。
medRxiv. 2025 Mar 17:2025.03.14.25323836. doi: 10.1101/2025.03.14.25323836.
5
DeepSeek in Healthcare: Revealing Opportunities and Steering Challenges of a New Open-Source Artificial Intelligence Frontier.医疗保健领域的DeepSeek:揭示新开源人工智能前沿的机遇与导向挑战
Cureus. 2025 Feb 18;17(2):e79221. doi: 10.7759/cureus.79221. eCollection 2025 Feb.
6
Evaluating advanced AI reasoning models: ChatGPT-4.0 and DeepSeek-R1 diagnostic performance in otolaryngology: a comparative analysis.评估先进的人工智能推理模型:ChatGPT-4.0和DeepSeek-R1在耳鼻喉科的诊断性能:一项对比分析。
Am J Otolaryngol. 2025 May 10;46(4):104667. doi: 10.1016/j.amjoto.2025.104667.
7
Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration.评估ChatGPT-4的诊断准确性:视觉数据整合的影响。
JMIR Med Inform. 2024 Apr 9;12:e55627. doi: 10.2196/55627.
8
Performance of DeepSeek-R1 and ChatGPT-4o on the Chinese National Medical Licensing Examination: A Comparative Study.DeepSeek-R1和ChatGPT-4o在中国国家医师资格考试中的表现:一项比较研究。
J Med Syst. 2025 Jun 3;49(1):74. doi: 10.1007/s10916-025-02213-z.
9
Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning.DeepSeek大语言模型在医学任务和临床推理方面的比较基准测试。
Nat Med. 2025 Apr 23. doi: 10.1038/s41591-025-03726-3.
10
Evaluating ChatGPT-4's Accuracy in Identifying Final Diagnoses Within Differential Diagnoses Compared With Those of Physicians: Experimental Study for Diagnostic Cases.评估ChatGPT-4在鉴别诊断中识别最终诊断的准确性与医生的准确性比较:诊断病例的实验研究
JMIR Form Res. 2024 Jun 26;8:e59267. doi: 10.2196/59267.

本文引用的文献

1
From GPT to DeepSeek: Significant gaps remain in realizing AI in healthcare.从GPT到DeepSeek:在医疗保健领域实现人工智能仍存在重大差距。
J Biomed Inform. 2025 Mar;163:104791. doi: 10.1016/j.jbi.2025.104791. Epub 2025 Feb 10.
2
DeepSeek versus ChatGPT: Multimodal artificial intelligence revolutionizing scientific discovery. From language editing to autonomous content generation-Redefining innovation in research and practice.深度求索与ChatGPT:多模态人工智能正在革新科学发现。从语言编辑到自主内容生成——重新定义研究与实践中的创新。
Knee Surg Sports Traumatol Arthrosc. 2025 May;33(5):1553-1556. doi: 10.1002/ksa.12628. Epub 2025 Feb 12.
3
How China created AI model DeepSeek and shocked the world.
中国如何创建人工智能模型“深寻”并震惊世界。
Nature. 2025 Feb;638(8050):300-301. doi: 10.1038/d41586-025-00259-0.
4
China's cheap, open AI model DeepSeek thrills scientists.中国廉价且开放的人工智能模型“百川”令科学家们兴奋不已。
Nature. 2025 Feb;638(8049):13-14. doi: 10.1038/d41586-025-00229-6.
5
Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge.生成式人工智能模型在复杂诊断挑战中的准确性。
JAMA. 2023 Jul 3;330(1):78-80. doi: 10.1001/jama.2023.8288.