OpenAI的新o1模型在常见眼科护理问题上能否超越其前身？

Can OpenAI's New o1 Model Outperform Its Predecessors in Common Eye Care Queries?

作者信息

Pushpanathan Krithi, Zou Minjie, Srinivasan Sahana, Wong Wendy Meihua, Mangunkusumo Erlangga Ariadarma, Thomas George Naveen, Lai Yien, Sun Chen-Hsin, Lam Janice Sing Harn, Tan Marcus Chun Jin, Lin Hazel Anne Hui'En, Ma Weizhi, Koh Victor Teck Chang, Chen David Ziyou, Tham Yih-Chung

机构信息

Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore.

Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore.

出版信息

Ophthalmol Sci. 2025 Feb 22;5(4):100745. doi: 10.1016/j.xops.2025.100745. eCollection 2025 Jul-Aug.

DOI:10.1016/j.xops.2025.100745

PMID:40291392

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12022690/

Abstract

OBJECTIVE

The newly launched OpenAI o1 is said to offer improved reasoning, potentially providing higher quality responses to eye care queries. However, its performance remains unassessed. We evaluated the performance of o1, ChatGPT-4o, and ChatGPT-4 in addressing ophthalmic-related queries, focusing on correctness, completeness, and readability.

DESIGN

Cross-sectional study.

SUBJECTS

Sixteen queries, previously identified as suboptimally responded to by ChatGPT-4 from prior studies, were used, covering 3 subtopics: myopia (6 questions), ocular symptoms (4 questions), and retinal conditions (6 questions).

METHODS

For each subtopic, 3 attending-level ophthalmologists, masked to the model sources, evaluated the responses based on correctness, completeness, and readability (on a 5-point scale for each metric).

MAIN OUTCOME MEASURES

Mean summed scores of each model for correctness, completeness, and readability, rated on a 5-point scale (maximum score: 15).

RESULTS

O1 scored highest in correctness (12.6) and readability (14.2), outperforming ChatGPT-4, which scored 10.3 ( = 0.010) and 12.4 ( < 0.001), respectively. No significant difference was found between o1 and ChatGPT-4o. When stratified by subtopics, o1 consistently demonstrated superior correctness and readability. In completeness, ChatGPT-4o achieved the highest score of 12.4, followed by o1 (10.8), though the difference was not statistically significant. o1 showed notable limitations in completeness for ocular symptom queries, scoring 5.5 out of 15.

CONCLUSIONS

While o1 is marketed as offering improved reasoning capabilities, its performance in addressing eye care queries does not significantly differ from its predecessor, ChatGPT-4o. Nevertheless, it surpasses ChatGPT-4, particularly in correctness and readability.

FINANCIAL DISCLOSURES

Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

摘要

目的

新推出的OpenAI o1据说具有改进的推理能力，可能会为眼科护理问题提供更高质量的回答。然而，其性能仍未得到评估。我们评估了o1、ChatGPT-4o和ChatGPT-4在处理眼科相关问题方面的性能，重点关注正确性、完整性和可读性。

设计

横断面研究。

研究对象

使用了先前研究中确定ChatGPT-4回答欠佳的16个问题，涵盖3个子主题：近视（6个问题）、眼部症状（4个问题）和视网膜疾病（6个问题）。

方法

对于每个子主题，3名主治医师级别的眼科医生在不知道模型来源的情况下，根据正确性、完整性和可读性（每个指标采用5分制）对回答进行评估。

主要观察指标

每个模型在正确性、完整性和可读性方面的平均总分，采用5分制评分（最高分：15分）。

结果

o1在正确性（12.6）和可读性（14.2）方面得分最高，优于ChatGPT-4，后者在正确性和可读性方面的得分分别为10.3（P = 0.010）和12.4（P < 0.001）。o1和ChatGPT-4o之间未发现显著差异。按子主题分层时，o1始终表现出更高的正确性和可读性。在完整性方面，ChatGPT-4o得分最高，为12.4，其次是o1（10.8），但差异无统计学意义。o1在眼部症状问题的完整性方面存在明显局限性，得分为15分中的5.5分。

结论

虽然o1被宣传为具有改进的推理能力，但其在处理眼科护理问题方面的性能与前身ChatGPT-4o相比没有显著差异。然而，它超过了ChatGPT-4，特别是在正确性和可读性方面。

财务披露

专有或商业披露信息可在本文末尾的脚注和披露中找到。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5ff3/12022690/ae791eaf6d7a/gr1.jpg

相似文献

Can OpenAI's New o1 Model Outperform Its Predecessors in Common Eye Care Queries?

Ophthalmol Sci. 2025 Feb 22;5(4):100745. doi: 10.1016/j.xops.2025.100745. eCollection 2025 Jul-Aug.

Evaluating the Effectiveness of Large Language Models in Providing Patient Education for Chinese Patients With Ocular Myasthenia Gravis: Mixed Methods Study.

J Med Internet Res. 2025 Apr 10;27:e67883. doi: 10.2196/67883.

Performance of AI-Chatbots to Common Temporomandibular Joint Disorders (TMDs) Patient Queries: Accuracy, Completeness, Reliability and Readability.

Orthod Craniofac Res. 2025 May 7. doi: 10.1111/ocr.12939.

Large Language Models: Pioneering New Educational Frontiers in Childhood Myopia.

Ophthalmol Ther. 2025 Jun;14(6):1281-1295. doi: 10.1007/s40123-025-01142-x. Epub 2025 Apr 21.

Evaluating ChatGPT and Google Gemini Performance and Implications in Turkish Dental Education.

Cureus. 2025 Jan 11;17(1):e77292. doi: 10.7759/cureus.77292. eCollection 2025 Jan.

An Evaluation of the Performance of OpenAI-o1 and GPT-4o in the Japanese National Examination for Physical Therapists.

Cureus. 2025 Jan 6;17(1):e76989. doi: 10.7759/cureus.76989. eCollection 2025 Jan.

Appropriateness and Readability of ChatGPT-4-Generated Responses for Surgical Treatment of Retinal Diseases.

Ophthalmol Retina. 2023 Oct;7(10):862-868. doi: 10.1016/j.oret.2023.05.022. Epub 2023 Jun 3.

Assessing the Quality and Reliability of ChatGPT's Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4.

JMIR Cancer. 2025 Apr 16;11:e63677. doi: 10.2196/63677.

Comparative Analysis of ChatGPT-4o and Gemini Advanced Performance on Diagnostic Radiology In-Training Exams.

Cureus. 2025 Mar 20;17(3):e80874. doi: 10.7759/cureus.80874. eCollection 2025 Mar.

Evaluation of the quality and readability of ChatGPT responses to frequently asked questions about myopia in traditional Chinese language.

Digit Health. 2024 Sep 2;10:20552076241277021. doi: 10.1177/20552076241277021. eCollection 2024 Jan-Dec.

引用本文的文献

Performance of ChatGPT-4 Omni and Gemini 1.5 Pro on Ophthalmology-Related Questions in the Turkish Medical Specialty Exam.

Turk J Ophthalmol. 2025 Aug 21;55(4):177-185. doi: 10.4274/tjo.galenos.2025.27895.

Ophthalmological Question Answering and Reasoning Using OpenAI o1 vs Other Large Language Models.

JAMA Ophthalmol. 2025 Jul 31. doi: 10.1001/jamaophthalmol.2025.2413.

本文引用的文献

Comparing generative and retrieval-based chatbots in answering patient questions regarding age-related macular degeneration and diabetic retinopathy.

Br J Ophthalmol. 2024 Sep 20;108(10):1443-1449. doi: 10.1136/bjo-2023-324533.

Assessing the Efficacy of Large Language Models in Health Literacy: A Comprehensive Cross-Sectional Study.

Yale J Biol Med. 2024 Mar 29;97(1):17-27. doi: 10.59249/ZTOZ1966. eCollection 2024 Mar.

Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks.

Nat Commun. 2024 Mar 6;15(1):2050. doi: 10.1038/s41467-024-46411-8.

Assessment of a Large Language Model's Responses to Questions and Cases About Glaucoma and Retina Management.

JAMA Ophthalmol. 2024 Apr 1;142(4):371-375. doi: 10.1001/jamaophthalmol.2023.6917.

Large language models and their impact in ophthalmology.

Lancet Digit Health. 2023 Dec;5(12):e917-e924. doi: 10.1016/S2589-7500(23)00201-7.

iScience. 2023 Oct 10;26(11):108163. doi: 10.1016/j.isci.2023.108163. eCollection 2023 Nov 17.

The future landscape of large language models in medicine.

Commun Med (Lond). 2023 Oct 10;3(1):141. doi: 10.1038/s43856-023-00370-1.

Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.

EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.

Use of Artificial Intelligence Chatbots for Cancer Treatment Information.

JAMA Oncol. 2023 Oct 1;9(10):1459-1462. doi: 10.1001/jamaoncol.2023.2954.

Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions.

JAMA Netw Open. 2023 Aug 1;6(8):e2330320. doi: 10.1001/jamanetworkopen.2023.30320.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

OpenAI的新o1模型在常见眼科护理问题上能否超越其前身？

Can OpenAI's New o1 Model Outperform Its Predecessors in Common Eye Care Queries?

作者信息

机构信息

出版信息

OBJECTIVE

DESIGN

SUBJECTS

METHODS

MAIN OUTCOME MEASURES

RESULTS

CONCLUSIONS

FINANCIAL DISCLOSURES

目的

设计

研究对象

方法

主要观察指标

结果

结论

财务披露

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献