Wallerstein Avi, Ramnawaz Taanvee, Gauvin Mathieu
Department of Ophthalmology and Visual Sciences, McGill University, Montreal, QC H4A 0A4, Canada.
LASIK MD, Montreal, QC H3B 4W8, Canada.
J Clin Med. 2025 Jul 22;14(15):5175. doi: 10.3390/jcm14155175.
To assess the accuracy of ChatGPT-4o and OpenAI-o1 in answering refractive surgery questions from the AAO BCSC Self-Assessment Program and to evaluate whether their performance could meaningfully support clinical decision making, we compared the models with 1983 ophthalmology residents and clinicians. : A randomized, questionnaire-based study was conducted with 228 text-only questions from the Refractive Surgery section of the BCSC Self-Assessment Program. Each model received the prompt, "Please provide an answer to the following questions." Accuracy was measured as the proportion of correct answers and reported with 95 percent confidence intervals. Differences between groups were assessed with the chi-squared test for independence and pairwise comparisons. : OpenAI-o1 achieved the highest score (91.2%, 95% CI 87.6-95.0%), followed by ChatGPT-4o (86.4%, 95% CI 81.9-90.9%) and the average score from 1983 users of the Refractive Surgery section of the BCSC Self-Assessment Program (77%, 95% CI 75.2-78.8%). Both language models significantly outperformed human users. The five-point margin of OpenAI-o1 over ChatGPT-4o did not reach statistical significance ( = 0.1045) but could represent one additional correct decision in twenty clinically relevant scenarios. : Both ChatGPT-4o and OpenAI-o1 significantly outperformed BCSC Program users, demonstrating a level of accuracy that could augment medical decision making. Although OpenAI-o1 scored higher than ChatGPT-4o, the difference did not reach statistical significance. These findings indicate that the "advanced reasoning" architecture of OpenAI-o1 offers only incremental gains and underscores the need for prospective studies linking LLM recommendations to concrete clinical outcomes before routine deployment in refractive-surgery practice.
为评估ChatGPT-4o和OpenAI-o1在回答美国眼科学会(AAO)基础与临床科学课程(BCSC)自我评估项目中屈光手术问题的准确性,并评估它们的表现是否能切实支持临床决策,我们将这两个模型与1983名眼科住院医师和临床医生进行了比较。:我们针对BCSC自我评估项目屈光手术部分的228个纯文本问题开展了一项基于问卷的随机研究。每个模型收到的提示语为“请回答以下问题”。准确性通过正确答案的比例来衡量,并报告95%置信区间。组间差异通过独立性卡方检验和两两比较进行评估。:OpenAI-o1得分最高(91.2%,95%置信区间87.6 - 95.0%),其次是ChatGPT-4o(86.4%,95%置信区间81.9 - 90.9%),以及BCSC自我评估项目屈光手术部分1983名用户的平均得分(77%,95%置信区间75.2 - 78.8%)。两个语言模型的表现均显著优于人类用户。OpenAI-o1比ChatGPT-4o高出的5个百分点未达到统计学显著性(P = 0.1045),但在20个临床相关场景中可能意味着多一个正确决策。:ChatGPT-4o和OpenAI-o1的表现均显著优于BCSC项目用户,显示出的准确性水平可增强医疗决策。尽管OpenAI-o1得分高于ChatGPT-4o,但差异未达到统计学显著性。这些发现表明,OpenAI-o1的“高级推理”架构仅带来了有限的提升,并强调在屈光手术实践中常规应用之前,需要进行前瞻性研究,将大语言模型的建议与具体临床结果联系起来。