DeepSeek R1、DeepSeek-R1-Lite、OpenAi o1 Pro和Grok 3在眼科委员会式问题上的性能比较分析。

A comparative analysis of DeepSeek R1, DeepSeek-R1-Lite, OpenAi o1 Pro, and Grok 3 performance on ophthalmology board-style questions.

作者信息

Shean Ryan, Shah Tathya, Pandiarajan Aditya, Tang Alan, Bolo Kyle, Nguyen Van, Xu Benjamin

机构信息

Keck School of Medicine, University of Southern California, 1975 Zonal Avenue, Los Angeles, CA, USA.

Information Sciences Institute, University of Southern California, 4676 Admiralty Way #1001, Marina Del Rey, CA, USA.

出版信息

Sci Rep. 2025 Jul 2;15(1):23101. doi: 10.1038/s41598-025-08601-2.

DOI:10.1038/s41598-025-08601-2

PMID:40595291

Abstract

The ability of large language models (LLMs) to accurately answer medical board-style questions reflects their potential to benefit medical education and real-time clinical decision-making. With the recent advance to reasoning models, the latest LLMs excel at addressing complex problems in benchmark math and science tests. This study assessed the performance of first-generation reasoning models-DeepSeek's R1 and R1-Lite, OpenAI's o1 Pro, and Grok 3-on 493 ophthalmology questions sourced from the StatPearls and EyeQuiz question banks. o1 Pro achieved the highest overall accuracy (83.4%), significantly outperforming DeepSeek R1 (72.5%), DeepSeek-R1-Lite (76.5%), and Grok 3 (69.2%) (p < 0.001 for all pairwise comparisons). o1 Pro also demonstrated superior performance in questions from eight of nine ophthalmologic subfields, questions of second and third order cognitive complexity, and on image-based questions. DeepSeek-R1-Lite performed the second best, despite relatively small memory requirements, while Grok 3 performed inferiorly overall. These findings demonstrate that the strong performance of the first-generation reasoning models extends beyond benchmark tests to high-complexity ophthalmology questions. While these findings suggest a potential role for reasoning models in medical education and clinical practice, further research is needed to understand their performance with real-world data, their integration into educational and clinical settings, and human-AI interactions.

摘要

大语言模型（LLMs）准确回答医学委员会风格问题的能力反映了它们在医学教育和实时临床决策中发挥作用的潜力。随着近期向推理模型的发展，最新的大语言模型在基准数学和科学测试中擅长解决复杂问题。本研究评估了第一代推理模型——深寻的R1和R1-Lite、OpenAI的o1 Pro以及Grok 3——在来自StatPearls和EyeQuiz题库的493道眼科问题上的表现。o1 Pro总体准确率最高（83.4%），显著优于深寻R1（72.5%）、深寻R1-Lite（76.5%）和Grok 3（69.2%）（所有两两比较p < 0.001）。o1 Pro在九个眼科子领域中的八个领域的问题、二阶和三阶认知复杂度的问题以及基于图像的问题上也表现出卓越性能。尽管内存需求相对较小，但深寻R1-Lite表现次之，而Grok 3总体表现较差。这些发现表明，第一代推理模型的强大性能不仅体现在基准测试中，在高复杂度的眼科问题上也同样出色。虽然这些发现表明推理模型在医学教育和临床实践中可能发挥作用，但还需要进一步研究以了解它们在实际数据中的表现、它们在教育和临床环境中的整合情况以及人机交互情况。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

DeepSeek R1、DeepSeek-R1-Lite、OpenAi o1 Pro和Grok 3在眼科委员会式问题上的性能比较分析。

A comparative analysis of DeepSeek R1, DeepSeek-R1-Lite, OpenAi o1 Pro, and Grok 3 performance on ophthalmology board-style questions.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

DeepSeek R1、DeepSeek-R1-Lite、OpenAi o1 Pro和Grok 3在眼科委员会式问题上的性能比较分析。

A comparative analysis of DeepSeek R1, DeepSeek-R1-Lite, OpenAi o1 Pro, and Grok 3 performance on ophthalmology board-style questions.

作者信息

机构信息

出版信息

相似文献

本文引用的文献