• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

DeepSeek R1、DeepSeek-R1-Lite、OpenAi o1 Pro和Grok 3在眼科委员会式问题上的性能比较分析。

A comparative analysis of DeepSeek R1, DeepSeek-R1-Lite, OpenAi o1 Pro, and Grok 3 performance on ophthalmology board-style questions.

作者信息

Shean Ryan, Shah Tathya, Pandiarajan Aditya, Tang Alan, Bolo Kyle, Nguyen Van, Xu Benjamin

机构信息

Keck School of Medicine, University of Southern California, 1975 Zonal Avenue, Los Angeles, CA, USA.

Information Sciences Institute, University of Southern California, 4676 Admiralty Way #1001, Marina Del Rey, CA, USA.

出版信息

Sci Rep. 2025 Jul 2;15(1):23101. doi: 10.1038/s41598-025-08601-2.

DOI:10.1038/s41598-025-08601-2
PMID:40595291
Abstract

The ability of large language models (LLMs) to accurately answer medical board-style questions reflects their potential to benefit medical education and real-time clinical decision-making. With the recent advance to reasoning models, the latest LLMs excel at addressing complex problems in benchmark math and science tests. This study assessed the performance of first-generation reasoning models-DeepSeek's R1 and R1-Lite, OpenAI's o1 Pro, and Grok 3-on 493 ophthalmology questions sourced from the StatPearls and EyeQuiz question banks. o1 Pro achieved the highest overall accuracy (83.4%), significantly outperforming DeepSeek R1 (72.5%), DeepSeek-R1-Lite (76.5%), and Grok 3 (69.2%) (p < 0.001 for all pairwise comparisons). o1 Pro also demonstrated superior performance in questions from eight of nine ophthalmologic subfields, questions of second and third order cognitive complexity, and on image-based questions. DeepSeek-R1-Lite performed the second best, despite relatively small memory requirements, while Grok 3 performed inferiorly overall. These findings demonstrate that the strong performance of the first-generation reasoning models extends beyond benchmark tests to high-complexity ophthalmology questions. While these findings suggest a potential role for reasoning models in medical education and clinical practice, further research is needed to understand their performance with real-world data, their integration into educational and clinical settings, and human-AI interactions.

摘要

大语言模型(LLMs)准确回答医学委员会风格问题的能力反映了它们在医学教育和实时临床决策中发挥作用的潜力。随着近期向推理模型的发展,最新的大语言模型在基准数学和科学测试中擅长解决复杂问题。本研究评估了第一代推理模型——深寻的R1和R1-Lite、OpenAI的o1 Pro以及Grok 3——在来自StatPearls和EyeQuiz题库的493道眼科问题上的表现。o1 Pro总体准确率最高(83.4%),显著优于深寻R1(72.5%)、深寻R1-Lite(76.5%)和Grok 3(69.2%)(所有两两比较p < 0.001)。o1 Pro在九个眼科子领域中的八个领域的问题、二阶和三阶认知复杂度的问题以及基于图像的问题上也表现出卓越性能。尽管内存需求相对较小,但深寻R1-Lite表现次之,而Grok 3总体表现较差。这些发现表明,第一代推理模型的强大性能不仅体现在基准测试中,在高复杂度的眼科问题上也同样出色。虽然这些发现表明推理模型在医学教育和临床实践中可能发挥作用,但还需要进一步研究以了解它们在实际数据中的表现、它们在教育和临床环境中的整合情况以及人机交互情况。

相似文献

1
A comparative analysis of DeepSeek R1, DeepSeek-R1-Lite, OpenAi o1 Pro, and Grok 3 performance on ophthalmology board-style questions.DeepSeek R1、DeepSeek-R1-Lite、OpenAi o1 Pro和Grok 3在眼科委员会式问题上的性能比较分析。
Sci Rep. 2025 Jul 2;15(1):23101. doi: 10.1038/s41598-025-08601-2.
2
Medical reasoning in LLMs: an in-depth analysis of DeepSeek R1.大语言模型中的医学推理:对DeepSeek R1的深入分析
Front Artif Intell. 2025 Jun 18;8:1616145. doi: 10.3389/frai.2025.1616145. eCollection 2025.
3
Large language models provide discordant information compared to ophthalmology guidelines.与眼科指南相比,大语言模型提供的信息不一致。
Sci Rep. 2025 Jul 1;15(1):20556. doi: 10.1038/s41598-025-06404-z.
4
Evaluating ChatGPT and DeepSeek in postdural puncture headache management: a comparative study with international consensus guidelines.评估ChatGPT和DeepSeek在硬膜穿刺后头痛管理中的应用:与国际共识指南的对比研究
BMC Neurol. 2025 Jul 1;25(1):264. doi: 10.1186/s12883-025-04280-8.
5
A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection.对大语言模型生成的尸体臂丛神经解剖分步指导的结构化评估。
BMC Med Educ. 2025 Jul 1;25(1):903. doi: 10.1186/s12909-025-07493-0.
6
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.ChatGPT-4o与四个开源大语言模型基于中国罕见病目录生成诊断的性能:比较研究
J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.
7
Performance analysis of large language models Chatgpt-4o, OpenAI O1, and OpenAI O3 mini in clinical treatment of pneumonia: a comparative study.大语言模型Chatgpt-4o、OpenAI O1和OpenAI O3 mini在肺炎临床治疗中的性能分析:一项对比研究。
Clin Exp Med. 2025 Jun 20;25(1):213. doi: 10.1007/s10238-025-01743-7.
8
Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations.大语言模型在非英语环境中的表现:对在中国医学考试中使用不同语言训练的模型的定性研究
JMIR Med Inform. 2025 Jun 27;13:e69485. doi: 10.2196/69485.
9
Clinical feasibility of AI Doctors: Evaluating the replacement potential of large language models in outpatient settings for central nervous system tumors.人工智能医生的临床可行性:评估大语言模型在中枢神经系统肿瘤门诊环境中的替代潜力。
Int J Med Inform. 2025 Jun 12;203:106013. doi: 10.1016/j.ijmedinf.2025.106013.
10
Risk of Bias Assessment of Diagnostic Accuracy Studies Using QUADAS 2 by Large Language Models.使用QUADAS-2对大型语言模型进行诊断准确性研究的偏倚风险评估
Diagnostics (Basel). 2025 Jun 6;15(12):1451. doi: 10.3390/diagnostics15121451.

本文引用的文献

1
Implementing large language models in healthcare while balancing control, collaboration, costs and security.在医疗保健领域应用大语言模型的同时,平衡控制、协作、成本和安全性。
NPJ Digit Med. 2025 Mar 6;8(1):143. doi: 10.1038/s41746-025-01476-7.
2
From GPT to DeepSeek: Significant gaps remain in realizing AI in healthcare.从GPT到DeepSeek:在医疗保健领域实现人工智能仍存在重大差距。
J Biomed Inform. 2025 Mar;163:104791. doi: 10.1016/j.jbi.2025.104791. Epub 2025 Feb 10.
3
China's cheap, open AI model DeepSeek thrills scientists.中国廉价且开放的人工智能模型“百川”令科学家们兴奋不已。
Nature. 2025 Feb;638(8049):13-14. doi: 10.1038/d41586-025-00229-6.
4
Factors Associated With the Accuracy of Large Language Models in Basic Medical Science Examinations: Cross-Sectional Study.基础医学考试中与大语言模型准确性相关的因素:横断面研究
JMIR Med Educ. 2025 Jan 13;11:e58898. doi: 10.2196/58898.
5
Enhancing Ophthalmic Care: The Transformative Potential of Digital Twins in Healthcare.提升眼科护理:数字孪生在医疗保健中的变革潜力。
Cureus. 2024 Dec 22;16(12):e76209. doi: 10.7759/cureus.76209. eCollection 2024 Dec.
6
NephroCheck as a Predictor of Acute Kidney Injury Following Coronary Artery Bypass Graft Surgery.NephroCheck作为冠状动脉搭桥手术后急性肾损伤的预测指标
Cureus. 2024 Dec 11;16(12):e75555. doi: 10.7759/cureus.75555. eCollection 2024 Dec.
7
Comparing the Accuracy and Readability of Glaucoma-related Question Responses and Educational Materials by Google and ChatGPT.比较谷歌和ChatGPT生成的青光眼相关问题答案及教育材料的准确性和可读性。
J Curr Glaucoma Pract. 2024 Jul-Sep;18(3):110-116. doi: 10.5005/jp-journals-10078-1448. Epub 2024 Oct 29.
8
Return to Play Guidelines in Pediatric Concussion: A Systematic Review of Current Literature.小儿脑震荡的重返赛场指南:当前文献的系统综述
J Craniofac Surg. 2024 Nov 1. doi: 10.1097/SCS.0000000000010837.
9
Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial.大语言模型对诊断推理的影响:一项随机临床试验。
JAMA Netw Open. 2024 Oct 1;7(10):e2440969. doi: 10.1001/jamanetworkopen.2024.40969.
10
Comparison of Gemini Advanced and ChatGPT 4.0's Performances on the Ophthalmology Resident Ophthalmic Knowledge Assessment Program (OKAP) Examination Review Question Banks.Gemini Advanced与ChatGPT 4.0在眼科住院医师眼科知识评估计划(OKAP)考试复习题库中的表现比较。
Cureus. 2024 Sep 17;16(9):e69612. doi: 10.7759/cureus.69612. eCollection 2024 Sep.