Suppr超能文献

OpenAI的o1大语言模型在眼科委员会风格的问题上表现优于GPT-4o、Gemini 1.5 Flash和人类考生。

OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board-Style Questions.

作者信息

Shean Ryan, Shah Tathya, Sobhani Sina, Tang Alan, Setayesh Ali, Bolo Kyle, Nguyen Van, Xu Benjamin

机构信息

Keck School of Medicine, University of Southern California, Los Angeles, California.

Roski Eye Institute, Keck School of Medicine, University of Southern California, Los Angeles, California.

出版信息

Ophthalmol Sci. 2025 Jun 6;5(6):100844. doi: 10.1016/j.xops.2025.100844. eCollection 2025 Nov-Dec.

Abstract

PURPOSE

To evaluate and compare the performance of human test takers and three artificial intelligence (AI) models-OpenAI o1, ChatGPT-4o, and Gemini 1.5 Flash-on ophthalmology board-style questions, focusing on overall accuracy and performance stratified by ophthalmic subspecialty and cognitive complexity level.

DESIGN

A cross-sectional study.

SUBJECTS

Five hundred questions sourced from the and question banks.

METHODS

Three large language models interpreted the questions using standardized prompting procedures. Subanalysis was performed, stratifying the questions by subspecialty and complexity defined by the Buckwalter taxonomic schema. Statistical analysis, including the analysis of variance and McNemar test, was conducted to assess performance differences.

MAIN OUTCOME MEASURES

Accuracy of responses for each model and human test takers, stratified by subspecialty and cognitive complexity.

RESULTS

OpenAI o1 achieved the highest overall accuracy (423/500, 84.6%), significantly outperforming GPT-4o (331/500, 66.2%; < 0.001) and Gemini (301/500, 60.2%; < 0.001). o1 demonstrated superior performance on both (228/250, 91.2%) and (195/250, 78.0%) questions compared with GPT-4o (: 183/250, 73.2%; : 148/250, 59.2%) and Gemini (: 163/250, 65.2%; : 137/250, 54.8%). On questions, human performance was lower (64.5%) than Gemini 1.5 Flash (65.2%), GPT-4o (73.2%), and OpenAI o1 (91.2%) ( < 0.001). OpenAI o1 outperformed other models in each of the nine ophthalmic subfields and three cognitive complexity levels.

CONCLUSIONS

OpenAI o1 outperformed GPT-4o, Gemini, and human test takers in answering ophthalmology board-style questions from two question banks and across three complexity levels. These findings highlight advances in AI technology and OpenAI o1's growing potential as an adjunct in ophthalmic education and care.

FINANCIAL DISCLOSURES

The author(s) have no proprietary or commercial interest in any materials discussed in this article.

摘要

目的

评估并比较人类考生与三种人工智能(AI)模型——OpenAI o1、ChatGPT - 4o和Gemini 1.5在眼科委员会风格问题上的表现,重点关注总体准确性以及按眼科亚专业和认知复杂程度分层的表现。

设计

横断面研究。

研究对象

从[具体来源1]和[具体来源2]题库中选取的500个问题。

方法

三个大语言模型使用标准化提示程序解读问题。进行子分析,按照Buckwalter分类模式定义的亚专业和复杂程度对问题进行分层。进行包括方差分析和McNemar检验在内的统计分析,以评估表现差异。

主要观察指标

每个模型和人类考生按亚专业和认知复杂程度分层的回答准确性。

结果

OpenAI o1总体准确率最高(423/500,84.6%),显著优于GPT - 4o(331/500,66.2%;P < 0.001)和Gemini(301/500,60.2%;P < 0.001)。与GPT - 4o([亚专业1]:183/250,73.2%;[亚专业2]:148/250,59.2%)和Gemini([亚专业1]:163/250,65.2%;[亚专业2]:137/250,54.8%)相比,o1在[亚专业1](228/250,91.2%)和[亚专业2](195/250,78.0%)问题上表现更优。在[特定类型]问题上,人类表现(64.5%)低于Gemini 1.5 Flash(65.2%)、GPT - 4o(73.2%)和OpenAI o1(91.2%)(P < 0.001)。OpenAI o1在九个眼科亚领域和三个认知复杂程度级别中的每一个方面都优于其他模型。

结论

在回答来自两个题库且跨越三个复杂程度级别的眼科委员会风格问题时,OpenAI o1的表现优于GPT - 4o、Gemini和人类考生。这些发现凸显了人工智能技术的进步以及OpenAI o1作为眼科教育和护理辅助工具日益增长的潜力。

财务披露

作者对本文讨论的任何材料均无专有或商业利益。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be15/12273424/ed1e95602e76/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验