Shean Ryan, Shah Tathya, Sobhani Sina, Tang Alan, Setayesh Ali, Bolo Kyle, Nguyen Van, Xu Benjamin
Keck School of Medicine, University of Southern California, Los Angeles, California.
Roski Eye Institute, Keck School of Medicine, University of Southern California, Los Angeles, California.
Ophthalmol Sci. 2025 Jun 6;5(6):100844. doi: 10.1016/j.xops.2025.100844. eCollection 2025 Nov-Dec.
To evaluate and compare the performance of human test takers and three artificial intelligence (AI) models-OpenAI o1, ChatGPT-4o, and Gemini 1.5 Flash-on ophthalmology board-style questions, focusing on overall accuracy and performance stratified by ophthalmic subspecialty and cognitive complexity level.
A cross-sectional study.
Five hundred questions sourced from the and question banks.
Three large language models interpreted the questions using standardized prompting procedures. Subanalysis was performed, stratifying the questions by subspecialty and complexity defined by the Buckwalter taxonomic schema. Statistical analysis, including the analysis of variance and McNemar test, was conducted to assess performance differences.
Accuracy of responses for each model and human test takers, stratified by subspecialty and cognitive complexity.
OpenAI o1 achieved the highest overall accuracy (423/500, 84.6%), significantly outperforming GPT-4o (331/500, 66.2%; < 0.001) and Gemini (301/500, 60.2%; < 0.001). o1 demonstrated superior performance on both (228/250, 91.2%) and (195/250, 78.0%) questions compared with GPT-4o (: 183/250, 73.2%; : 148/250, 59.2%) and Gemini (: 163/250, 65.2%; : 137/250, 54.8%). On questions, human performance was lower (64.5%) than Gemini 1.5 Flash (65.2%), GPT-4o (73.2%), and OpenAI o1 (91.2%) ( < 0.001). OpenAI o1 outperformed other models in each of the nine ophthalmic subfields and three cognitive complexity levels.
OpenAI o1 outperformed GPT-4o, Gemini, and human test takers in answering ophthalmology board-style questions from two question banks and across three complexity levels. These findings highlight advances in AI technology and OpenAI o1's growing potential as an adjunct in ophthalmic education and care.
The author(s) have no proprietary or commercial interest in any materials discussed in this article.
评估并比较人类考生与三种人工智能(AI)模型——OpenAI o1、ChatGPT - 4o和Gemini 1.5在眼科委员会风格问题上的表现,重点关注总体准确性以及按眼科亚专业和认知复杂程度分层的表现。
横断面研究。
从[具体来源1]和[具体来源2]题库中选取的500个问题。
三个大语言模型使用标准化提示程序解读问题。进行子分析,按照Buckwalter分类模式定义的亚专业和复杂程度对问题进行分层。进行包括方差分析和McNemar检验在内的统计分析,以评估表现差异。
每个模型和人类考生按亚专业和认知复杂程度分层的回答准确性。
OpenAI o1总体准确率最高(423/500,84.6%),显著优于GPT - 4o(331/500,66.2%;P < 0.001)和Gemini(301/500,60.2%;P < 0.001)。与GPT - 4o([亚专业1]:183/250,73.2%;[亚专业2]:148/250,59.2%)和Gemini([亚专业1]:163/250,65.2%;[亚专业2]:137/250,54.8%)相比,o1在[亚专业1](228/250,91.2%)和[亚专业2](195/250,78.0%)问题上表现更优。在[特定类型]问题上,人类表现(64.5%)低于Gemini 1.5 Flash(65.2%)、GPT - 4o(73.2%)和OpenAI o1(91.2%)(P < 0.001)。OpenAI o1在九个眼科亚领域和三个认知复杂程度级别中的每一个方面都优于其他模型。
在回答来自两个题库且跨越三个复杂程度级别的眼科委员会风格问题时,OpenAI o1的表现优于GPT - 4o、Gemini和人类考生。这些发现凸显了人工智能技术的进步以及OpenAI o1作为眼科教育和护理辅助工具日益增长的潜力。
作者对本文讨论的任何材料均无专有或商业利益。