用于眼科诊断和管理计划的DeepSeek-R1与OpenAI o1对比

DeepSeek-R1 vs OpenAI o1 for Ophthalmic Diagnoses and Management Plans.

作者信息

Mikhail David, Farah Andrew, Milad Jason, Mihalache Andrew, Milad Daniel, Antaki Fares, Balas Michael, Popovic Marko M, Muni Rajeev H, Keane Pearse A, Duval Renaud

机构信息

Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada.

Faculty of Medicine, McGill University, Montreal, Quebec, Canada.

出版信息

JAMA Ophthalmol. 2025 Sep 4. doi: 10.1001/jamaophthalmol.2025.2918.

DOI:10.1001/jamaophthalmol.2025.2918

PMID:40906471

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12412037/

Abstract

IMPORTANCE

Large language models (LLMs) are increasingly being explored in clinical decision-making, but few studies have evaluated their performance on complex ophthalmology cases from clinical practice settings. Understanding whether open-weight, reasoning-enhanced LLMs can outperform proprietary models has implications for clinical utility and accessibility.

OBJECTIVE

To evaluate the diagnostic accuracy, management decision-making, and cost of DeepSeek-R1 vs OpenAI o1 across diverse ophthalmic subspecialties.

DESIGN, SETTING, AND PARTICIPANTS: This was a cross-sectional evaluation conducted using standardized prompts and model configurations. Clinical cases were sourced from JAMA Ophthalmology's Clinical Challenge articles, containing complex cases from clinical practice settings. Each case included an open-ended diagnostic question and a multiple-choice next-step decision. All cases were included without exclusions, and no human participants were involved. Data were analyzed from March 13 to March 30, 2025.

EXPOSURES

DeepSeek-R1and OpenAI o1 were evaluated using the Plan-and-Solve Plus (PS+) prompt engineering method.

MAIN OUTCOMES AND MEASURES

Primary outcomes were diagnostic accuracy and next-step decision-making accuracy, defined as the proportion of correct responses. Token cost analyses were performed to estimate expenses. Intermodel agreement was evaluated using Cohen κ, and McNemar test was used to compare performance.

RESULTS

A total of 422 clinical cases were included, spanning 10 subspecialties. DeepSeek-R1 achieved a higher diagnostic accuracy of 70.4% (297 of 422 cases) compared with 63.0% (266 of 422 cases) for OpenAI o1, a 7.3% difference (95% CI, 1.0%-13.7%; P = .02). For next-step decisions, DeepSeek-R1 was correct in 82.7% of cases (349 of 422 cases) vs OpenAI o1's accuracy of 75.8% (320 of 422 cases), a 6.9% difference (95% CI, 1.4%-12.3%; P = .01). Intermodel agreement was moderate (κ = 0.422; 95% CI, 0.375-0.469; P < .001). DeepSeek-R1 offered lower costs per query than OpenAI o1, with savings exceeding 66-fold (up to 98.5%) during off-peak pricing.

CONCLUSIONS AND RELEVANCE

DeepSeek-R1 outperformed OpenAI o1 in diagnosis and management across subspecialties while lowering operating costs, supporting the potential of open-weight, reinforcement learning-augmented LLMs as scalable and cost-saving tools for clinical decision support. Further investigations should evaluate safety guardrails and assess performance of self-hosted adaptations of DeepSeek-R1 with domain-specific ophthalmic expertise to optimize clinical utility.

摘要

重要性

大语言模型（LLMs）在临床决策中的应用越来越受到关注，但很少有研究评估其在临床实践中复杂眼科病例上的表现。了解开放权重、推理增强的大语言模型是否能优于专有模型，对临床实用性和可及性具有重要意义。

目的

评估DeepSeek-R1与OpenAI o1在不同眼科亚专业中的诊断准确性、管理决策能力和成本。

设计、设置和参与者：这是一项使用标准化提示和模型配置进行的横断面评估。临床病例来自《美国医学会眼科杂志》的临床挑战文章，包含临床实践中的复杂病例。每个病例包括一个开放式诊断问题和一个多项选择的下一步决策。所有病例均无排除，且未涉及人类参与者。数据于2025年3月13日至3月30日进行分析。

暴露因素

使用计划与解决增强版（PS+）提示工程方法对DeepSeek-R1和OpenAI o1进行评估。

主要结局和测量指标

主要结局为诊断准确性和下一步决策准确性，定义为正确回答的比例。进行令牌成本分析以估计费用。使用科恩κ系数评估模型间的一致性，并使用麦克尼马尔检验比较性能。

结果

共纳入422例临床病例，涵盖10个亚专业。DeepSeek-R1的诊断准确率为70.4%（422例中的297例），高于OpenAI o1的63.0%（422例中的266例），相差7.3%（95%CI，1.0%-13.7%；P = 0.02）。对于下一步决策，DeepSeek-R1在82.7%的病例中正确（422例中的349例），而OpenAI o1的准确率为75.8%（422例中的320例），相差6.9%（95%CI，1.4%-12.3%；P = 0.01）。模型间一致性为中等（κ = 0.422；95%CI，0.375-0.469；P < 0.001）。DeepSeek-R1每次查询的成本低于OpenAI o1，在非高峰定价期间节省超过66倍（高达98.5%）。

结论和相关性

DeepSeek-R1在各亚专业的诊断和管理方面优于OpenAI o1，同时降低了运营成本，支持开放权重、强化学习增强的大语言模型作为临床决策支持的可扩展和节省成本工具的潜力。进一步的研究应评估安全防护措施，并评估具有特定眼科专业知识的DeepSeek-R1自托管改编版本的性能，以优化临床实用性。

相似文献

DeepSeek-R1 vs OpenAI o1 for Ophthalmic Diagnoses and Management Plans.

JAMA Ophthalmol. 2025 Sep 4. doi: 10.1001/jamaophthalmol.2025.2918.

Performance of DeepSeek-R1 in ophthalmology: an evaluation of clinical decision-making and cost-effectiveness.

Br J Ophthalmol. 2025 Aug 20;109(9):976-981. doi: 10.1136/bjo-2025-327360.

DeepSeek-R1 outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in bilingual complex ophthalmology reasoning.

Adv Ophthalmol Pract Res. 2025 May 9;5(3):189-195. doi: 10.1016/j.aopr.2025.05.001. eCollection 2025 Aug-Sep.

Evaluating the Reasoning Capabilities of Large Language Models for Medical Coding and Hospital Readmission Risk Stratification: Zero-Shot Prompting Approach.

J Med Internet Res. 2025 Jul 30;27:e74142. doi: 10.2196/74142.

Prescription of Controlled Substances: Benefits and Risks

Diagnostic performance of newly developed large language models in critical illness cases: A comparative study.

Int J Med Inform. 2025 Dec;204:106088. doi: 10.1016/j.ijmedinf.2025.106088. Epub 2025 Aug 23.

Ophthalmological Question Answering and Reasoning Using OpenAI o1 vs Other Large Language Models.

JAMA Ophthalmol. 2025 Jul 31. doi: 10.1001/jamaophthalmol.2025.2413.

A comparative analysis of DeepSeek R1, DeepSeek-R1-Lite, OpenAi o1 Pro, and Grok 3 performance on ophthalmology board-style questions.

Sci Rep. 2025 Jul 2;15(1):23101. doi: 10.1038/s41598-025-08601-2.

Assessing DeepSeek-R1 for Clinical Decision Support in Multidisciplinary Laboratory Medicine.

J Multidiscip Healthc. 2025 Aug 12;18:4979-4988. doi: 10.2147/JMDH.S538253. eCollection 2025.

Comparative analysis of the performance of the large language models DeepSeek-V3, DeepSeek-R1, open AI-O3 mini and open AI-O3 mini high in urology.

World J Urol. 2025 Jul 7;43(1):416. doi: 10.1007/s00345-025-05757-4.

本文引用的文献

Performance of DeepSeek-R1 in ophthalmology: an evaluation of clinical decision-making and cost-effectiveness.

Br J Ophthalmol. 2025 Aug 20;109(9):976-981. doi: 10.1136/bjo-2025-327360.

DeepSeek-R1 outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in bilingual complex ophthalmology reasoning.

Adv Ophthalmol Pract Res. 2025 May 9;5(3):189-195. doi: 10.1016/j.aopr.2025.05.001. eCollection 2025 Aug-Sep.

Assessing the performance of Microsoft Copilot, GPT-4 and Google Gemini in ophthalmology.

Can J Ophthalmol. 2025 Feb 4. doi: 10.1016/j.jcjo.2025.01.001.

China's cheap, open AI model DeepSeek thrills scientists.

Nature. 2025 Feb;638(8049):13-14. doi: 10.1038/d41586-025-00229-6.

The TRIPOD-LLM reporting guideline for studies using large language models.

Nat Med. 2025 Jan;31(1):60-69. doi: 10.1038/s41591-024-03425-5. Epub 2025 Jan 8.

Performance of ChatGPT in French language analysis of multimodal retinal cases.

J Fr Ophtalmol. 2025 Mar;48(3):104391. doi: 10.1016/j.jfo.2024.104391. Epub 2024 Dec 20.

Proof-of-concept study of a small language model chatbot for breast cancer decision support - a transparent, source-controlled, explainable and data-secure approach.

J Cancer Res Clin Oncol. 2024 Oct 9;150(10):451. doi: 10.1007/s00432-024-05964-3.

Interpretation of Clinical Retinal Images Using an Artificial Intelligence Chatbot.

Ophthalmol Sci. 2024 May 23;4(6):100556. doi: 10.1016/j.xops.2024.100556. eCollection 2024 Nov-Dec.

Foundation models in ophthalmology.

Br J Ophthalmol. 2024 Sep 20;108(10):1341-1348. doi: 10.1136/bjo-2024-325459.

Assessing the medical reasoning skills of GPT-4 in complex ophthalmology cases.

Br J Ophthalmol. 2024 Sep 20;108(10):1398-1405. doi: 10.1136/bjo-2023-325053.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

用于眼科诊断和管理计划的DeepSeek-R1与OpenAI o1对比

DeepSeek-R1 vs OpenAI o1 for Ophthalmic Diagnoses and Management Plans.

作者信息

Mikhail David, Farah Andrew, Milad Jason, Mihalache Andrew, Milad Daniel, Antaki Fares, Balas Michael, Popovic Marko M, Muni Rajeev H, Keane Pearse A, Duval Renaud

机构信息

Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada.

Faculty of Medicine, McGill University, Montreal, Quebec, Canada.

出版信息

JAMA Ophthalmol. 2025 Sep 4. doi: 10.1001/jamaophthalmol.2025.2918.

DOI:10.1001/jamaophthalmol.2025.2918

PMID:40906471

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12412037/

Abstract

IMPORTANCE

OBJECTIVE

To evaluate the diagnostic accuracy, management decision-making, and cost of DeepSeek-R1 vs OpenAI o1 across diverse ophthalmic subspecialties.

EXPOSURES

DeepSeek-R1and OpenAI o1 were evaluated using the Plan-and-Solve Plus (PS+) prompt engineering method.

MAIN OUTCOMES AND MEASURES

RESULTS

CONCLUSIONS AND RELEVANCE

摘要

重要性

目的

评估DeepSeek-R1与OpenAI o1在不同眼科亚专业中的诊断准确性、管理决策能力和成本。

暴露因素

使用计划与解决增强版（PS+）提示工程方法对DeepSeek-R1和OpenAI o1进行评估。

用于眼科诊断和管理计划的DeepSeek-R1与OpenAI o1对比

DeepSeek-R1 vs OpenAI o1 for Ophthalmic Diagnoses and Management Plans.

作者信息

机构信息

出版信息

IMPORTANCE

OBJECTIVE

EXPOSURES

MAIN OUTCOMES AND MEASURES

RESULTS

CONCLUSIONS AND RELEVANCE

重要性

目的

暴露因素

主要结局和测量指标

结果

结论和相关性

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

用于眼科诊断和管理计划的DeepSeek-R1与OpenAI o1对比

DeepSeek-R1 vs OpenAI o1 for Ophthalmic Diagnoses and Management Plans.

作者信息

机构信息

出版信息

IMPORTANCE

OBJECTIVE

EXPOSURES

MAIN OUTCOMES AND MEASURES

RESULTS

CONCLUSIONS AND RELEVANCE

重要性

目的

暴露因素

主要结局和测量指标

结果

结论和相关性