• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ChatGPT-4o与OpenAI-o1:屈光手术中其准确性的比较分析。

ChatGPT-4o and OpenAI-o1: A Comparative Analysis of Its Accuracy in Refractive Surgery.

作者信息

Wallerstein Avi, Ramnawaz Taanvee, Gauvin Mathieu

机构信息

Department of Ophthalmology and Visual Sciences, McGill University, Montreal, QC H4A 0A4, Canada.

LASIK MD, Montreal, QC H3B 4W8, Canada.

出版信息

J Clin Med. 2025 Jul 22;14(15):5175. doi: 10.3390/jcm14155175.

DOI:10.3390/jcm14155175
PMID:40806797
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12347465/
Abstract

To assess the accuracy of ChatGPT-4o and OpenAI-o1 in answering refractive surgery questions from the AAO BCSC Self-Assessment Program and to evaluate whether their performance could meaningfully support clinical decision making, we compared the models with 1983 ophthalmology residents and clinicians. : A randomized, questionnaire-based study was conducted with 228 text-only questions from the Refractive Surgery section of the BCSC Self-Assessment Program. Each model received the prompt, "Please provide an answer to the following questions." Accuracy was measured as the proportion of correct answers and reported with 95 percent confidence intervals. Differences between groups were assessed with the chi-squared test for independence and pairwise comparisons. : OpenAI-o1 achieved the highest score (91.2%, 95% CI 87.6-95.0%), followed by ChatGPT-4o (86.4%, 95% CI 81.9-90.9%) and the average score from 1983 users of the Refractive Surgery section of the BCSC Self-Assessment Program (77%, 95% CI 75.2-78.8%). Both language models significantly outperformed human users. The five-point margin of OpenAI-o1 over ChatGPT-4o did not reach statistical significance ( = 0.1045) but could represent one additional correct decision in twenty clinically relevant scenarios. : Both ChatGPT-4o and OpenAI-o1 significantly outperformed BCSC Program users, demonstrating a level of accuracy that could augment medical decision making. Although OpenAI-o1 scored higher than ChatGPT-4o, the difference did not reach statistical significance. These findings indicate that the "advanced reasoning" architecture of OpenAI-o1 offers only incremental gains and underscores the need for prospective studies linking LLM recommendations to concrete clinical outcomes before routine deployment in refractive-surgery practice.

摘要

为评估ChatGPT-4o和OpenAI-o1在回答美国眼科学会(AAO)基础与临床科学课程(BCSC)自我评估项目中屈光手术问题的准确性,并评估它们的表现是否能切实支持临床决策,我们将这两个模型与1983名眼科住院医师和临床医生进行了比较。:我们针对BCSC自我评估项目屈光手术部分的228个纯文本问题开展了一项基于问卷的随机研究。每个模型收到的提示语为“请回答以下问题”。准确性通过正确答案的比例来衡量,并报告95%置信区间。组间差异通过独立性卡方检验和两两比较进行评估。:OpenAI-o1得分最高(91.2%,95%置信区间87.6 - 95.0%),其次是ChatGPT-4o(86.4%,95%置信区间81.9 - 90.9%),以及BCSC自我评估项目屈光手术部分1983名用户的平均得分(77%,95%置信区间75.2 - 78.8%)。两个语言模型的表现均显著优于人类用户。OpenAI-o1比ChatGPT-4o高出的5个百分点未达到统计学显著性(P = 0.1045),但在20个临床相关场景中可能意味着多一个正确决策。:ChatGPT-4o和OpenAI-o1的表现均显著优于BCSC项目用户,显示出的准确性水平可增强医疗决策。尽管OpenAI-o1得分高于ChatGPT-4o,但差异未达到统计学显著性。这些发现表明,OpenAI-o1的“高级推理”架构仅带来了有限的提升,并强调在屈光手术实践中常规应用之前,需要进行前瞻性研究,将大语言模型的建议与具体临床结果联系起来。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2d43/12347465/86b7020dd776/jcm-14-05175-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2d43/12347465/86b7020dd776/jcm-14-05175-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2d43/12347465/86b7020dd776/jcm-14-05175-g001.jpg

相似文献

1
ChatGPT-4o and OpenAI-o1: A Comparative Analysis of Its Accuracy in Refractive Surgery.ChatGPT-4o与OpenAI-o1:屈光手术中其准确性的比较分析。
J Clin Med. 2025 Jul 22;14(15):5175. doi: 10.3390/jcm14155175.
2
Performance analysis of large language models Chatgpt-4o, OpenAI O1, and OpenAI O3 mini in clinical treatment of pneumonia: a comparative study.大语言模型Chatgpt-4o、OpenAI O1和OpenAI O3 mini在肺炎临床治疗中的性能分析:一项对比研究。
Clin Exp Med. 2025 Jun 20;25(1):213. doi: 10.1007/s10238-025-01743-7.
3
OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board-Style Questions.OpenAI的o1大语言模型在眼科委员会风格的问题上表现优于GPT-4o、Gemini 1.5 Flash和人类考生。
Ophthalmol Sci. 2025 Jun 6;5(6):100844. doi: 10.1016/j.xops.2025.100844. eCollection 2025 Nov-Dec.
4
Ophthalmological Question Answering and Reasoning Using OpenAI o1 vs Other Large Language Models.使用OpenAI的o1与其他大语言模型进行眼科问答和推理
JAMA Ophthalmol. 2025 Jul 31. doi: 10.1001/jamaophthalmol.2025.2413.
5
ChatGPT-4o Compared With Human Researchers in Writing Plain-Language Summaries for Cochrane Reviews: A Blinded, Randomized Non-Inferiority Controlled Trial.ChatGPT-4o与人类研究人员在为Cochrane系统评价撰写通俗易懂的总结方面的比较:一项双盲、随机非劣效性对照试验。
Cochrane Evid Synth Methods. 2025 Jul 28;3(4):e70037. doi: 10.1002/cesm.70037. eCollection 2025 Jul.
6
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.ChatGPT-4o与四个开源大语言模型基于中国罕见病目录生成诊断的性能:比较研究
J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.
7
One Year On: Assessing Progress of Multimodal Large Language Model Performance on RSNA 2024 Case of the Day Questions.一年之后:评估多模态大语言模型在RSNA 2024每日病例问题上的性能进展。
Radiology. 2025 Aug;316(2):e250617. doi: 10.1148/radiol.250617.
8
[Preliminary exploration of the applications of five large language models in the field of oral auxiliary diagnosis, treatment and health consultation].五种大语言模型在口腔辅助诊断、治疗及健康咨询领域的应用初探
Zhonghua Kou Qiang Yi Xue Za Zhi. 2025 Jul 30;60(8):871-878. doi: 10.3760/cma.j.cn112144-20241107-00418.
9
Artificial Intelligence in Hand and Upper Extremity Surgery Education: Accuracy and Validity of ChatGPT-4o Versus UpToDate as a Learning Tool for Trainees.人工智能在手部和上肢外科手术教育中的应用:ChatGPT-4o与UpToDate作为实习生学习工具的准确性和有效性
Eplasty. 2025 May 14;25:e17. eCollection 2025.
10
Evaluation of error detection and treatment recommendations in nucleic acid test reports using ChatGPT models.使用ChatGPT模型评估核酸检测报告中的错误检测和治疗建议
Clin Chem Lab Med. 2025 Apr 21. doi: 10.1515/cclm-2025-0089.

本文引用的文献

1
OpenAI o1-Preview vs. ChatGPT in Healthcare: A New Frontier in Medical AI Reasoning.医疗领域中OpenAI的o1-预览版与ChatGPT对比:医学人工智能推理的新前沿
Cureus. 2024 Oct 1;16(10):e70640. doi: 10.7759/cureus.70640. eCollection 2024 Oct.
2
Assessing large language models' accuracy in providing patient support for choroidal melanoma.评估大型语言模型在为脉络膜黑色素瘤患者提供支持方面的准确性。
Eye (Lond). 2024 Nov;38(16):3113-3117. doi: 10.1038/s41433-024-03231-w. Epub 2024 Jul 13.
3
Exploring the Role of ChatGPT-4, BingAI, and Gemini as Virtual Consultants to Educate Families about Retinopathy of Prematurity.
探索ChatGPT-4、必应人工智能和Gemini作为虚拟顾问在向家庭普及早产儿视网膜病变知识方面的作用。
Children (Basel). 2024 Jun 20;11(6):750. doi: 10.3390/children11060750.
4
Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study.大语言模型在眼科领域接近专家级临床知识和推理能力:一项直接比较的横断面研究。
PLOS Digit Health. 2024 Apr 17;3(4):e0000341. doi: 10.1371/journal.pdig.0000341. eCollection 2024 Apr.
5
Performance of three artificial intelligence chatbots on Ophthalmic Knowledge Assessment Program materials.三款人工智能聊天机器人在眼科知识评估项目材料方面的表现。
Can J Ophthalmol. 2024 Aug;59(4):e380-e381. doi: 10.1016/j.jcjo.2024.01.011. Epub 2024 Feb 23.
6
Development and Evaluation of Aeyeconsult: A Novel Ophthalmology Chatbot Leveraging Verified Textbook Knowledge and GPT-4.Aeyeconsult的开发与评估:一种利用权威教科书知识和GPT-4的新型眼科聊天机器人
J Surg Educ. 2024 Mar;81(3):438-443. doi: 10.1016/j.jsurg.2023.11.019. Epub 2023 Dec 21.
7
"Application and accuracy of artificial intelligence-derived large language models in patients with age related macular degeneration".人工智能衍生的大语言模型在年龄相关性黄斑变性患者中的应用及准确性
Int J Retina Vitreous. 2023 Nov 18;9(1):71. doi: 10.1186/s40942-023-00511-7.
8
Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering.GPT-4 在眼科领域的能力:对模型熵的分析及迈向人类水平医学问答的进展。
Br J Ophthalmol. 2024 Sep 20;108(10):1371-1378. doi: 10.1136/bjo-2023-324438.
9
Improved Performance of ChatGPT-4 on the OKAP Examination: A Comparative Study with ChatGPT-3.5.ChatGPT-4在医师执照考试(OKAP)中的表现提升:与ChatGPT-3.5的对比研究
J Acad Ophthalmol (2017). 2023 Sep 11;15(2):e184-e187. doi: 10.1055/s-0043-1774399. eCollection 2023 Jul.
10
Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.比较分析 ChatGPT-3.5、ChatGPT-4.0 和谷歌巴德在近视防控方面的表现:大型语言模型的基准测试。
EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.