ChatGPT-4o与OpenAI-o1：屈光手术中其准确性的比较分析。

ChatGPT-4o and OpenAI-o1: A Comparative Analysis of Its Accuracy in Refractive Surgery.

作者信息

Wallerstein Avi, Ramnawaz Taanvee, Gauvin Mathieu

机构信息

Department of Ophthalmology and Visual Sciences, McGill University, Montreal, QC H4A 0A4, Canada.

LASIK MD, Montreal, QC H3B 4W8, Canada.

出版信息

J Clin Med. 2025 Jul 22;14(15):5175. doi: 10.3390/jcm14155175.

DOI:10.3390/jcm14155175

PMID:40806797

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12347465/

Abstract

To assess the accuracy of ChatGPT-4o and OpenAI-o1 in answering refractive surgery questions from the AAO BCSC Self-Assessment Program and to evaluate whether their performance could meaningfully support clinical decision making, we compared the models with 1983 ophthalmology residents and clinicians. : A randomized, questionnaire-based study was conducted with 228 text-only questions from the Refractive Surgery section of the BCSC Self-Assessment Program. Each model received the prompt, "Please provide an answer to the following questions." Accuracy was measured as the proportion of correct answers and reported with 95 percent confidence intervals. Differences between groups were assessed with the chi-squared test for independence and pairwise comparisons. : OpenAI-o1 achieved the highest score (91.2%, 95% CI 87.6-95.0%), followed by ChatGPT-4o (86.4%, 95% CI 81.9-90.9%) and the average score from 1983 users of the Refractive Surgery section of the BCSC Self-Assessment Program (77%, 95% CI 75.2-78.8%). Both language models significantly outperformed human users. The five-point margin of OpenAI-o1 over ChatGPT-4o did not reach statistical significance ( = 0.1045) but could represent one additional correct decision in twenty clinically relevant scenarios. : Both ChatGPT-4o and OpenAI-o1 significantly outperformed BCSC Program users, demonstrating a level of accuracy that could augment medical decision making. Although OpenAI-o1 scored higher than ChatGPT-4o, the difference did not reach statistical significance. These findings indicate that the "advanced reasoning" architecture of OpenAI-o1 offers only incremental gains and underscores the need for prospective studies linking LLM recommendations to concrete clinical outcomes before routine deployment in refractive-surgery practice.

摘要

为评估ChatGPT-4o和OpenAI-o1在回答美国眼科学会（AAO）基础与临床科学课程（BCSC）自我评估项目中屈光手术问题的准确性，并评估它们的表现是否能切实支持临床决策，我们将这两个模型与1983名眼科住院医师和临床医生进行了比较。：我们针对BCSC自我评估项目屈光手术部分的228个纯文本问题开展了一项基于问卷的随机研究。每个模型收到的提示语为“请回答以下问题”。准确性通过正确答案的比例来衡量，并报告95%置信区间。组间差异通过独立性卡方检验和两两比较进行评估。：OpenAI-o1得分最高（91.2%，95%置信区间87.6 - 95.0%），其次是ChatGPT-4o（86.4%，95%置信区间81.9 - 90.9%），以及BCSC自我评估项目屈光手术部分1983名用户的平均得分（77%，95%置信区间75.2 - 78.8%）。两个语言模型的表现均显著优于人类用户。OpenAI-o1比ChatGPT-4o高出的5个百分点未达到统计学显著性（P = 0.1045），但在20个临床相关场景中可能意味着多一个正确决策。：ChatGPT-4o和OpenAI-o1的表现均显著优于BCSC项目用户，显示出的准确性水平可增强医疗决策。尽管OpenAI-o1得分高于ChatGPT-4o，但差异未达到统计学显著性。这些发现表明，OpenAI-o1的“高级推理”架构仅带来了有限的提升，并强调在屈光手术实践中常规应用之前，需要进行前瞻性研究，将大语言模型的建议与具体临床结果联系起来。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2d43/12347465/86b7020dd776/jcm-14-05175-g001.jpg

相似文献

ChatGPT-4o and OpenAI-o1: A Comparative Analysis of Its Accuracy in Refractive Surgery.

J Clin Med. 2025 Jul 22;14(15):5175. doi: 10.3390/jcm14155175.

Performance analysis of large language models Chatgpt-4o, OpenAI O1, and OpenAI O3 mini in clinical treatment of pneumonia: a comparative study.

Clin Exp Med. 2025 Jun 20;25(1):213. doi: 10.1007/s10238-025-01743-7.

OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board-Style Questions.

Ophthalmol Sci. 2025 Jun 6;5(6):100844. doi: 10.1016/j.xops.2025.100844. eCollection 2025 Nov-Dec.

Ophthalmological Question Answering and Reasoning Using OpenAI o1 vs Other Large Language Models.

JAMA Ophthalmol. 2025 Jul 31. doi: 10.1001/jamaophthalmol.2025.2413.

ChatGPT-4o Compared With Human Researchers in Writing Plain-Language Summaries for Cochrane Reviews: A Blinded, Randomized Non-Inferiority Controlled Trial.

Cochrane Evid Synth Methods. 2025 Jul 28;3(4):e70037. doi: 10.1002/cesm.70037. eCollection 2025 Jul.

Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.

J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.

One Year On: Assessing Progress of Multimodal Large Language Model Performance on RSNA 2024 Case of the Day Questions.

Radiology. 2025 Aug;316(2):e250617. doi: 10.1148/radiol.250617.

[Preliminary exploration of the applications of five large language models in the field of oral auxiliary diagnosis, treatment and health consultation].

Zhonghua Kou Qiang Yi Xue Za Zhi. 2025 Jul 30;60(8):871-878. doi: 10.3760/cma.j.cn112144-20241107-00418.

Artificial Intelligence in Hand and Upper Extremity Surgery Education: Accuracy and Validity of ChatGPT-4o Versus UpToDate as a Learning Tool for Trainees.

Eplasty. 2025 May 14;25:e17. eCollection 2025.

Evaluation of error detection and treatment recommendations in nucleic acid test reports using ChatGPT models.

Clin Chem Lab Med. 2025 Apr 21. doi: 10.1515/cclm-2025-0089.

本文引用的文献

OpenAI o1-Preview vs. ChatGPT in Healthcare: A New Frontier in Medical AI Reasoning.

Cureus. 2024 Oct 1;16(10):e70640. doi: 10.7759/cureus.70640. eCollection 2024 Oct.

Assessing large language models' accuracy in providing patient support for choroidal melanoma.

Eye (Lond). 2024 Nov;38(16):3113-3117. doi: 10.1038/s41433-024-03231-w. Epub 2024 Jul 13.

Exploring the Role of ChatGPT-4, BingAI, and Gemini as Virtual Consultants to Educate Families about Retinopathy of Prematurity.

Children (Basel). 2024 Jun 20;11(6):750. doi: 10.3390/children11060750.

Large language models approach expert-level clinical knowledge and reasoning in ophthalmology: A head-to-head cross-sectional study.

PLOS Digit Health. 2024 Apr 17;3(4):e0000341. doi: 10.1371/journal.pdig.0000341. eCollection 2024 Apr.

Performance of three artificial intelligence chatbots on Ophthalmic Knowledge Assessment Program materials.

Can J Ophthalmol. 2024 Aug;59(4):e380-e381. doi: 10.1016/j.jcjo.2024.01.011. Epub 2024 Feb 23.

Development and Evaluation of Aeyeconsult: A Novel Ophthalmology Chatbot Leveraging Verified Textbook Knowledge and GPT-4.

J Surg Educ. 2024 Mar;81(3):438-443. doi: 10.1016/j.jsurg.2023.11.019. Epub 2023 Dec 21.

"Application and accuracy of artificial intelligence-derived large language models in patients with age related macular degeneration".

Int J Retina Vitreous. 2023 Nov 18;9(1):71. doi: 10.1186/s40942-023-00511-7.

Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering.

Br J Ophthalmol. 2024 Sep 20;108(10):1371-1378. doi: 10.1136/bjo-2023-324438.

Improved Performance of ChatGPT-4 on the OKAP Examination: A Comparative Study with ChatGPT-3.5.

J Acad Ophthalmol (2017). 2023 Sep 11;15(2):e184-e187. doi: 10.1055/s-0043-1774399. eCollection 2023 Jul.

Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.

EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

ChatGPT-4o与OpenAI-o1：屈光手术中其准确性的比较分析。

ChatGPT-4o and OpenAI-o1: A Comparative Analysis of Its Accuracy in Refractive Surgery.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献