Chan Lining, Xu Xinjie, Lv Kaiyang
Department of Plastic Surgery, Xinhua Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, People's Republic of China.
Int J Surg. 2025 Jun 1;111(6):4056-4059. doi: 10.1097/JS9.0000000000002386. Epub 2025 Apr 3.
Large language models (LLMs) have demonstrated potential in medical diagnostics, but their accuracy in complex cases remains a subject of investigation. DeepSeek-R1, an open-source model with advanced reasoning capabilities, has gained global attention. This study evaluates the diagnostic performance of DeepSeek-R1 compared to GPT-4 in complex clinical cases.
A historical control study was conducted using 100 clinicopathologic cases from the New England Journal of Medicine (NEJM), published between 18 August 2022, and 30 January 2025. Each case was processed using DeepSeek-R1 with a structured diagnostic prompt. The model's performance was assessed based on final diagnosis accuracy, differential diagnosis inclusion rate, ranking of correct diagnoses, and differential quality scores. Results were statistically compared to previously published GPT-4 performance data using chi-square, Mann-Whitney U, and t-tests.
DeepSeek-R1 correctly matched the final diagnosis in 35% of cases (35/100), which was comparable to GPT-4's accuracy (39%; P = 0.634). However, DeepSeek-R1 included the correct diagnosis in its differential list in 48% of cases, significantly lower than GPT-4 (64%; P = 0.036). DeepSeek-R1 generated longer differential diagnoses (11.9 ± 2.0 vs. 9.0 ± 1.4; P = 0.000004) but maintained a similar mean rank for correct diagnoses (1.8 ± 2.2 vs. 2.5 ± 2.5; P = 0.288566) and equivalent differential quality scores (4.2 ± 0.10 vs. 4.2 ± 1.3; P = 0.099667).
DeepSeek-R1 exhibits diagnostic accuracy comparable to GPT-4 while generating more diverse differential diagnoses. Its open-source nature and innovative reasoning strategies may enhance medical AI applications. Future studies should explore real-world clinical integration and refinement of differential diagnosis prioritization.
大语言模型(LLMs)已在医学诊断中展现出潜力,但其在复杂病例中的准确性仍是一个研究课题。具有先进推理能力的开源模型DeepSeek-R1已引起全球关注。本研究评估了DeepSeek-R1在复杂临床病例中与GPT-4相比的诊断性能。
采用历史对照研究,使用了2022年8月18日至2025年1月30日期间发表在《新英格兰医学杂志》(NEJM)上的100例临床病理病例。每个病例使用带有结构化诊断提示的DeepSeek-R1进行处理。基于最终诊断准确性、鉴别诊断纳入率、正确诊断排名和鉴别质量评分评估模型的性能。使用卡方检验、曼-惠特尼U检验和t检验将结果与先前发表的GPT-4性能数据进行统计学比较。
DeepSeek-R1在35%的病例(35/100)中正确匹配了最终诊断,这与GPT-4的准确性相当(39%;P = 0.634)。然而,DeepSeek-R1在48%的病例中将正确诊断纳入其鉴别诊断列表中,显著低于GPT-4(64%;P = 0.036)。DeepSeek-R1生成的鉴别诊断更长(11.9 ± 2.0对9.0 ± 1.4;P = 0.000004),但正确诊断的平均排名相似(1.8 ± 2.2对2.5 ± 2.5;P = 0.288566),鉴别质量评分相当(4.2 ± 0.10对4.2 ± 1.3;P = 0.099667)。
DeepSeek-R1表现出与GPT-4相当的诊断准确性,同时生成更多样化的鉴别诊断。其开源性质和创新的推理策略可能会增强医学人工智能应用。未来的研究应探索其在现实世界临床中的整合以及鉴别诊断优先级的优化。