Suppr超能文献

ChatGPT-4o与OpenAI-o1:屈光手术中其准确性的比较分析。

ChatGPT-4o and OpenAI-o1: A Comparative Analysis of Its Accuracy in Refractive Surgery.

作者信息

Wallerstein Avi, Ramnawaz Taanvee, Gauvin Mathieu

机构信息

Department of Ophthalmology and Visual Sciences, McGill University, Montreal, QC H4A 0A4, Canada.

LASIK MD, Montreal, QC H3B 4W8, Canada.

出版信息

J Clin Med. 2025 Jul 22;14(15):5175. doi: 10.3390/jcm14155175.

Abstract

To assess the accuracy of ChatGPT-4o and OpenAI-o1 in answering refractive surgery questions from the AAO BCSC Self-Assessment Program and to evaluate whether their performance could meaningfully support clinical decision making, we compared the models with 1983 ophthalmology residents and clinicians. : A randomized, questionnaire-based study was conducted with 228 text-only questions from the Refractive Surgery section of the BCSC Self-Assessment Program. Each model received the prompt, "Please provide an answer to the following questions." Accuracy was measured as the proportion of correct answers and reported with 95 percent confidence intervals. Differences between groups were assessed with the chi-squared test for independence and pairwise comparisons. : OpenAI-o1 achieved the highest score (91.2%, 95% CI 87.6-95.0%), followed by ChatGPT-4o (86.4%, 95% CI 81.9-90.9%) and the average score from 1983 users of the Refractive Surgery section of the BCSC Self-Assessment Program (77%, 95% CI 75.2-78.8%). Both language models significantly outperformed human users. The five-point margin of OpenAI-o1 over ChatGPT-4o did not reach statistical significance ( = 0.1045) but could represent one additional correct decision in twenty clinically relevant scenarios. : Both ChatGPT-4o and OpenAI-o1 significantly outperformed BCSC Program users, demonstrating a level of accuracy that could augment medical decision making. Although OpenAI-o1 scored higher than ChatGPT-4o, the difference did not reach statistical significance. These findings indicate that the "advanced reasoning" architecture of OpenAI-o1 offers only incremental gains and underscores the need for prospective studies linking LLM recommendations to concrete clinical outcomes before routine deployment in refractive-surgery practice.

摘要

为评估ChatGPT-4o和OpenAI-o1在回答美国眼科学会(AAO)基础与临床科学课程(BCSC)自我评估项目中屈光手术问题的准确性,并评估它们的表现是否能切实支持临床决策,我们将这两个模型与1983名眼科住院医师和临床医生进行了比较。:我们针对BCSC自我评估项目屈光手术部分的228个纯文本问题开展了一项基于问卷的随机研究。每个模型收到的提示语为“请回答以下问题”。准确性通过正确答案的比例来衡量,并报告95%置信区间。组间差异通过独立性卡方检验和两两比较进行评估。:OpenAI-o1得分最高(91.2%,95%置信区间87.6 - 95.0%),其次是ChatGPT-4o(86.4%,95%置信区间81.9 - 90.9%),以及BCSC自我评估项目屈光手术部分1983名用户的平均得分(77%,95%置信区间75.2 - 78.8%)。两个语言模型的表现均显著优于人类用户。OpenAI-o1比ChatGPT-4o高出的5个百分点未达到统计学显著性(P = 0.1045),但在20个临床相关场景中可能意味着多一个正确决策。:ChatGPT-4o和OpenAI-o1的表现均显著优于BCSC项目用户,显示出的准确性水平可增强医疗决策。尽管OpenAI-o1得分高于ChatGPT-4o,但差异未达到统计学显著性。这些发现表明,OpenAI-o1的“高级推理”架构仅带来了有限的提升,并强调在屈光手术实践中常规应用之前,需要进行前瞻性研究,将大语言模型的建议与具体临床结果联系起来。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2d43/12347465/86b7020dd776/jcm-14-05175-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验