Mikhail David, Farah Andrew, Milad Jason, Mihalache Andrew, Milad Daniel, Antaki Fares, Balas Michael, Popovic Marko M, Muni Rajeev H, Keane Pearse A, Duval Renaud
Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada.
Faculty of Medicine, McGill University, Montreal, Quebec, Canada.
JAMA Ophthalmol. 2025 Sep 4. doi: 10.1001/jamaophthalmol.2025.2918.
Large language models (LLMs) are increasingly being explored in clinical decision-making, but few studies have evaluated their performance on complex ophthalmology cases from clinical practice settings. Understanding whether open-weight, reasoning-enhanced LLMs can outperform proprietary models has implications for clinical utility and accessibility.
To evaluate the diagnostic accuracy, management decision-making, and cost of DeepSeek-R1 vs OpenAI o1 across diverse ophthalmic subspecialties.
DESIGN, SETTING, AND PARTICIPANTS: This was a cross-sectional evaluation conducted using standardized prompts and model configurations. Clinical cases were sourced from JAMA Ophthalmology's Clinical Challenge articles, containing complex cases from clinical practice settings. Each case included an open-ended diagnostic question and a multiple-choice next-step decision. All cases were included without exclusions, and no human participants were involved. Data were analyzed from March 13 to March 30, 2025.
DeepSeek-R1and OpenAI o1 were evaluated using the Plan-and-Solve Plus (PS+) prompt engineering method.
Primary outcomes were diagnostic accuracy and next-step decision-making accuracy, defined as the proportion of correct responses. Token cost analyses were performed to estimate expenses. Intermodel agreement was evaluated using Cohen κ, and McNemar test was used to compare performance.
A total of 422 clinical cases were included, spanning 10 subspecialties. DeepSeek-R1 achieved a higher diagnostic accuracy of 70.4% (297 of 422 cases) compared with 63.0% (266 of 422 cases) for OpenAI o1, a 7.3% difference (95% CI, 1.0%-13.7%; P = .02). For next-step decisions, DeepSeek-R1 was correct in 82.7% of cases (349 of 422 cases) vs OpenAI o1's accuracy of 75.8% (320 of 422 cases), a 6.9% difference (95% CI, 1.4%-12.3%; P = .01). Intermodel agreement was moderate (κ = 0.422; 95% CI, 0.375-0.469; P < .001). DeepSeek-R1 offered lower costs per query than OpenAI o1, with savings exceeding 66-fold (up to 98.5%) during off-peak pricing.
DeepSeek-R1 outperformed OpenAI o1 in diagnosis and management across subspecialties while lowering operating costs, supporting the potential of open-weight, reinforcement learning-augmented LLMs as scalable and cost-saving tools for clinical decision support. Further investigations should evaluate safety guardrails and assess performance of self-hosted adaptations of DeepSeek-R1 with domain-specific ophthalmic expertise to optimize clinical utility.
大语言模型(LLMs)在临床决策中的应用越来越受到关注,但很少有研究评估其在临床实践中复杂眼科病例上的表现。了解开放权重、推理增强的大语言模型是否能优于专有模型,对临床实用性和可及性具有重要意义。
评估DeepSeek-R1与OpenAI o1在不同眼科亚专业中的诊断准确性、管理决策能力和成本。
设计、设置和参与者:这是一项使用标准化提示和模型配置进行的横断面评估。临床病例来自《美国医学会眼科杂志》的临床挑战文章,包含临床实践中的复杂病例。每个病例包括一个开放式诊断问题和一个多项选择的下一步决策。所有病例均无排除,且未涉及人类参与者。数据于2025年3月13日至3月30日进行分析。
使用计划与解决增强版(PS+)提示工程方法对DeepSeek-R1和OpenAI o1进行评估。
主要结局为诊断准确性和下一步决策准确性,定义为正确回答的比例。进行令牌成本分析以估计费用。使用科恩κ系数评估模型间的一致性,并使用麦克尼马尔检验比较性能。
共纳入422例临床病例,涵盖10个亚专业。DeepSeek-R1的诊断准确率为70.4%(422例中的297例),高于OpenAI o1的63.0%(422例中的266例),相差7.3%(95%CI,1.0%-13.7%;P = 0.02)。对于下一步决策,DeepSeek-R1在82.7%的病例中正确(422例中的349例),而OpenAI o1的准确率为75.8%(422例中的320例),相差6.9%(95%CI,1.4%-12.3%;P = 0.01)。模型间一致性为中等(κ = 0.422;95%CI,0.375-0.469;P < 0.001)。DeepSeek-R1每次查询的成本低于OpenAI o1,在非高峰定价期间节省超过66倍(高达98.5%)。
DeepSeek-R1在各亚专业的诊断和管理方面优于OpenAI o1,同时降低了运营成本,支持开放权重、强化学习增强的大语言模型作为临床决策支持的可扩展和节省成本工具的潜力。进一步的研究应评估安全防护措施,并评估具有特定眼科专业知识的DeepSeek-R1自托管改编版本的性能,以优化临床实用性。