• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

OpenAI的o1大语言模型在眼科委员会风格的问题上表现优于GPT-4o、Gemini 1.5 Flash和人类考生。

OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board-Style Questions.

作者信息

Shean Ryan, Shah Tathya, Sobhani Sina, Tang Alan, Setayesh Ali, Bolo Kyle, Nguyen Van, Xu Benjamin

机构信息

Keck School of Medicine, University of Southern California, Los Angeles, California.

Roski Eye Institute, Keck School of Medicine, University of Southern California, Los Angeles, California.

出版信息

Ophthalmol Sci. 2025 Jun 6;5(6):100844. doi: 10.1016/j.xops.2025.100844. eCollection 2025 Nov-Dec.

DOI:10.1016/j.xops.2025.100844
PMID:40689255
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12273424/
Abstract

PURPOSE

To evaluate and compare the performance of human test takers and three artificial intelligence (AI) models-OpenAI o1, ChatGPT-4o, and Gemini 1.5 Flash-on ophthalmology board-style questions, focusing on overall accuracy and performance stratified by ophthalmic subspecialty and cognitive complexity level.

DESIGN

A cross-sectional study.

SUBJECTS

Five hundred questions sourced from the and question banks.

METHODS

Three large language models interpreted the questions using standardized prompting procedures. Subanalysis was performed, stratifying the questions by subspecialty and complexity defined by the Buckwalter taxonomic schema. Statistical analysis, including the analysis of variance and McNemar test, was conducted to assess performance differences.

MAIN OUTCOME MEASURES

Accuracy of responses for each model and human test takers, stratified by subspecialty and cognitive complexity.

RESULTS

OpenAI o1 achieved the highest overall accuracy (423/500, 84.6%), significantly outperforming GPT-4o (331/500, 66.2%; < 0.001) and Gemini (301/500, 60.2%; < 0.001). o1 demonstrated superior performance on both (228/250, 91.2%) and (195/250, 78.0%) questions compared with GPT-4o (: 183/250, 73.2%; : 148/250, 59.2%) and Gemini (: 163/250, 65.2%; : 137/250, 54.8%). On questions, human performance was lower (64.5%) than Gemini 1.5 Flash (65.2%), GPT-4o (73.2%), and OpenAI o1 (91.2%) ( < 0.001). OpenAI o1 outperformed other models in each of the nine ophthalmic subfields and three cognitive complexity levels.

CONCLUSIONS

OpenAI o1 outperformed GPT-4o, Gemini, and human test takers in answering ophthalmology board-style questions from two question banks and across three complexity levels. These findings highlight advances in AI technology and OpenAI o1's growing potential as an adjunct in ophthalmic education and care.

FINANCIAL DISCLOSURES

The author(s) have no proprietary or commercial interest in any materials discussed in this article.

摘要

目的

评估并比较人类考生与三种人工智能(AI)模型——OpenAI o1、ChatGPT - 4o和Gemini 1.5在眼科委员会风格问题上的表现,重点关注总体准确性以及按眼科亚专业和认知复杂程度分层的表现。

设计

横断面研究。

研究对象

从[具体来源1]和[具体来源2]题库中选取的500个问题。

方法

三个大语言模型使用标准化提示程序解读问题。进行子分析,按照Buckwalter分类模式定义的亚专业和复杂程度对问题进行分层。进行包括方差分析和McNemar检验在内的统计分析,以评估表现差异。

主要观察指标

每个模型和人类考生按亚专业和认知复杂程度分层的回答准确性。

结果

OpenAI o1总体准确率最高(423/500,84.6%),显著优于GPT - 4o(331/500,66.2%;P < 0.001)和Gemini(301/500,60.2%;P < 0.001)。与GPT - 4o([亚专业1]:183/250,73.2%;[亚专业2]:148/250,59.2%)和Gemini([亚专业1]:163/250,65.2%;[亚专业2]:137/250,54.8%)相比,o1在[亚专业1](228/250,91.2%)和[亚专业2](195/250,78.0%)问题上表现更优。在[特定类型]问题上,人类表现(64.5%)低于Gemini 1.5 Flash(65.2%)、GPT - 4o(73.2%)和OpenAI o1(91.2%)(P < 0.001)。OpenAI o1在九个眼科亚领域和三个认知复杂程度级别中的每一个方面都优于其他模型。

结论

在回答来自两个题库且跨越三个复杂程度级别的眼科委员会风格问题时,OpenAI o1的表现优于GPT - 4o、Gemini和人类考生。这些发现凸显了人工智能技术的进步以及OpenAI o1作为眼科教育和护理辅助工具日益增长的潜力。

财务披露

作者对本文讨论的任何材料均无专有或商业利益。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be15/12273424/8c020f85f746/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be15/12273424/ed1e95602e76/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be15/12273424/db57e1f097a2/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be15/12273424/8c020f85f746/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be15/12273424/ed1e95602e76/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be15/12273424/db57e1f097a2/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/be15/12273424/8c020f85f746/gr3.jpg

相似文献

1
OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board-Style Questions.OpenAI的o1大语言模型在眼科委员会风格的问题上表现优于GPT-4o、Gemini 1.5 Flash和人类考生。
Ophthalmol Sci. 2025 Jun 6;5(6):100844. doi: 10.1016/j.xops.2025.100844. eCollection 2025 Nov-Dec.
2
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响:比较案例研究
JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.
3
Assessing the performance of Microsoft Copilot, GPT-4 and Google Gemini in ophthalmology.评估Microsoft Copilot、GPT-4和Google Gemini在眼科领域的性能。
Can J Ophthalmol. 2025 Feb 4. doi: 10.1016/j.jcjo.2025.01.001.
4
Performance analysis of large language models Chatgpt-4o, OpenAI O1, and OpenAI O3 mini in clinical treatment of pneumonia: a comparative study.大语言模型Chatgpt-4o、OpenAI O1和OpenAI O3 mini在肺炎临床治疗中的性能分析:一项对比研究。
Clin Exp Med. 2025 Jun 20;25(1):213. doi: 10.1007/s10238-025-01743-7.
5
A comparative analysis of DeepSeek R1, DeepSeek-R1-Lite, OpenAi o1 Pro, and Grok 3 performance on ophthalmology board-style questions.DeepSeek R1、DeepSeek-R1-Lite、OpenAi o1 Pro和Grok 3在眼科委员会式问题上的性能比较分析。
Sci Rep. 2025 Jul 2;15(1):23101. doi: 10.1038/s41598-025-08601-2.
6
Evaluating the Performance of Reasoning Large Language Models on Japanese Radiology Board Examination Questions.评估推理大型语言模型在日本放射学委员会考试问题上的表现。
Acad Radiol. 2025 May 17. doi: 10.1016/j.acra.2025.04.060.
7
Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.使用标准化多项选择题评估大型语言模型在精神病学中的准确性和可靠性:横断面研究
J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910.
8
Thyroid Eye Disease and Artificial Intelligence: A Comparative Study of ChatGPT-3.5, ChatGPT-4o, and Gemini in Patient Information Delivery.甲状腺眼病与人工智能:ChatGPT-3.5、ChatGPT-4o和Gemini在患者信息传递方面的比较研究
Ophthalmic Plast Reconstr Surg. 2024 Dec 24. doi: 10.1097/IOP.0000000000002882.
9
Performance of 7 Artificial Intelligence Chatbots on Board-style Endodontic Questions.7款人工智能聊天机器人在根管治疗式问题上的表现
J Endod. 2025 Jun 26. doi: 10.1016/j.joen.2025.06.014.
10
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现:系统评价和荟萃分析。
J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.

本文引用的文献

1
Comparison of Gemini Advanced and ChatGPT 4.0's Performances on the Ophthalmology Resident Ophthalmic Knowledge Assessment Program (OKAP) Examination Review Question Banks.Gemini Advanced与ChatGPT 4.0在眼科住院医师眼科知识评估计划(OKAP)考试复习题库中的表现比较。
Cureus. 2024 Sep 17;16(9):e69612. doi: 10.7759/cureus.69612. eCollection 2024 Sep.
2
Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: a descriptive study.GPT-3.5 和 GPT-4 在标准化美国泌尿科知识评估项目中的表现:一项描述性研究。
J Educ Eval Health Prof. 2024;21:17. doi: 10.3352/jeehp.2024.21.17. Epub 2024 Jul 8.
3
Google Gemini and Bard artificial intelligence chatbot performance in ophthalmology knowledge assessment.
谷歌 Gemini 和巴德人工智能聊天机器人在眼科知识评估中的表现。
Eye (Lond). 2024 Sep;38(13):2530-2535. doi: 10.1038/s41433-024-03067-4. Epub 2024 Apr 13.
4
Large language models for generating medical examinations: systematic review.生成医学检查的大型语言模型:系统评价。
BMC Med Educ. 2024 Mar 29;24(1):354. doi: 10.1186/s12909-024-05239-y.
5
ChatGPT in medicine: prospects and challenges: a review article.ChatGPT 在医学中的应用:前景与挑战:一篇综述文章。
Int J Surg. 2024 Jun 1;110(6):3701-3706. doi: 10.1097/JS9.0000000000001312.
6
Twelve tips to leverage AI for efficient and effective medical question generation: A guide for educators using Chat GPT.利用 AI 提高医学问题生成效率和效果的 12 个技巧:Chat GPT 教学应用指南
Med Teach. 2024 Aug;46(8):1021-1026. doi: 10.1080/0142159X.2023.2294703. Epub 2023 Dec 26.
7
AI-Enabled Medical Education: Threads of Change, Promising Futures, and Risky Realities Across Four Potential Future Worlds.人工智能辅助医学教育:四大潜在未来世界中的变革脉络、美好前景和风险现实。
JMIR Med Educ. 2023 Dec 25;9:e50373. doi: 10.2196/50373.
8
Developing Medical Education Curriculum Reform Strategies to Address the Impact of Generative AI: Qualitative Study.制定医学教育课程改革策略以应对生成式人工智能的影响:定性研究
JMIR Med Educ. 2023 Nov 30;9:e53466. doi: 10.2196/53466.
9
Prompt engineering when using generative AI in nursing education.在护理教育中使用生成式人工智能时的提示工程。
Nurse Educ Pract. 2024 Jan;74:103825. doi: 10.1016/j.nepr.2023.103825. Epub 2023 Nov 1.
10
Opportunities, Challenges, and Future Directions of Generative Artificial Intelligence in Medical Education: Scoping Review.生成式人工智能在医学教育中的机遇、挑战与未来方向:范围综述
JMIR Med Educ. 2023 Oct 20;9:e48785. doi: 10.2196/48785.