Suppr超能文献

OpenAI的新o1模型在常见眼科护理问题上能否超越其前身?

Can OpenAI's New o1 Model Outperform Its Predecessors in Common Eye Care Queries?

作者信息

Pushpanathan Krithi, Zou Minjie, Srinivasan Sahana, Wong Wendy Meihua, Mangunkusumo Erlangga Ariadarma, Thomas George Naveen, Lai Yien, Sun Chen-Hsin, Lam Janice Sing Harn, Tan Marcus Chun Jin, Lin Hazel Anne Hui'En, Ma Weizhi, Koh Victor Teck Chang, Chen David Ziyou, Tham Yih-Chung

机构信息

Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore.

Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore.

出版信息

Ophthalmol Sci. 2025 Feb 22;5(4):100745. doi: 10.1016/j.xops.2025.100745. eCollection 2025 Jul-Aug.

Abstract

OBJECTIVE

The newly launched OpenAI o1 is said to offer improved reasoning, potentially providing higher quality responses to eye care queries. However, its performance remains unassessed. We evaluated the performance of o1, ChatGPT-4o, and ChatGPT-4 in addressing ophthalmic-related queries, focusing on correctness, completeness, and readability.

DESIGN

Cross-sectional study.

SUBJECTS

Sixteen queries, previously identified as suboptimally responded to by ChatGPT-4 from prior studies, were used, covering 3 subtopics: myopia (6 questions), ocular symptoms (4 questions), and retinal conditions (6 questions).

METHODS

For each subtopic, 3 attending-level ophthalmologists, masked to the model sources, evaluated the responses based on correctness, completeness, and readability (on a 5-point scale for each metric).

MAIN OUTCOME MEASURES

Mean summed scores of each model for correctness, completeness, and readability, rated on a 5-point scale (maximum score: 15).

RESULTS

O1 scored highest in correctness (12.6) and readability (14.2), outperforming ChatGPT-4, which scored 10.3 ( = 0.010) and 12.4 ( < 0.001), respectively. No significant difference was found between o1 and ChatGPT-4o. When stratified by subtopics, o1 consistently demonstrated superior correctness and readability. In completeness, ChatGPT-4o achieved the highest score of 12.4, followed by o1 (10.8), though the difference was not statistically significant. o1 showed notable limitations in completeness for ocular symptom queries, scoring 5.5 out of 15.

CONCLUSIONS

While o1 is marketed as offering improved reasoning capabilities, its performance in addressing eye care queries does not significantly differ from its predecessor, ChatGPT-4o. Nevertheless, it surpasses ChatGPT-4, particularly in correctness and readability.

FINANCIAL DISCLOSURES

Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

摘要

目的

新推出的OpenAI o1据说具有改进的推理能力,可能会为眼科护理问题提供更高质量的回答。然而,其性能仍未得到评估。我们评估了o1、ChatGPT-4o和ChatGPT-4在处理眼科相关问题方面的性能,重点关注正确性、完整性和可读性。

设计

横断面研究。

研究对象

使用了先前研究中确定ChatGPT-4回答欠佳的16个问题,涵盖3个子主题:近视(6个问题)、眼部症状(4个问题)和视网膜疾病(6个问题)。

方法

对于每个子主题,3名主治医师级别的眼科医生在不知道模型来源的情况下,根据正确性、完整性和可读性(每个指标采用5分制)对回答进行评估。

主要观察指标

每个模型在正确性、完整性和可读性方面的平均总分,采用5分制评分(最高分:15分)。

结果

o1在正确性(12.6)和可读性(14.2)方面得分最高,优于ChatGPT-4,后者在正确性和可读性方面的得分分别为10.3(P = 0.010)和12.4(P < 0.001)。o1和ChatGPT-4o之间未发现显著差异。按子主题分层时,o1始终表现出更高的正确性和可读性。在完整性方面,ChatGPT-4o得分最高,为12.4,其次是o1(10.8),但差异无统计学意义。o1在眼部症状问题的完整性方面存在明显局限性,得分为15分中的5.5分。

结论

虽然o1被宣传为具有改进的推理能力,但其在处理眼科护理问题方面的性能与前身ChatGPT-4o相比没有显著差异。然而,它超过了ChatGPT-4,特别是在正确性和可读性方面。

财务披露

专有或商业披露信息可在本文末尾的脚注和披露中找到。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5ff3/12022690/ae791eaf6d7a/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验