Chetla Nitin, Samayamanthula Sai S, Chang Joseph He, Leigh Arnold Y, Akosman Sinan, Tandon Mihir, Hage Tamer R, Cusick Michael
University of Virginia School of Medicine, Charlottesville, VA, USA.
University of Passau, Passau, Germany.
Clin Ophthalmol. 2025 Aug 31;19:3103-3112. doi: 10.2147/OPTH.S517238. eCollection 2025.
Diabetic retinopathy (DR) is a leading cause of vision loss in working-age adults. Despite the importance of early DR detection, only 60% of patients with diabetes receive recommended annual screenings due to limited eye care provider capacity. FDA-approved AI systems were developed to meet the growing demand for DR screening; however, high costs and specialized equipment limit accessibility. More accessible and equally as accurate AI systems need to be evaluated to combat this disparity. This study evaluated the diagnostic accuracy of ChatGPT-4 Omni (GPT-4o) in classifying DR from color fundus photographs (CFPs) to assess its potential as a low-cost alternative screening tool.
We utilized the publicly available EyePACS DR detection competition dataset from Kaggle, which includes 2,500 CFPs representing no DR, mild DR, moderate DR, severe DR, and proliferative DR. Each image was presented to GPT-4o with 1 of 8 prompts designed to enhance the model's accuracy. The results were analyzed through confusion matrices, and metrics such as accuracy, precision, sensitivity, specificity, and F1 scores were calculated to evaluate performance.
In prompts 1-3, GPT-4o showed a strong bias towards classifying images as no DR, with an average accuracy of 51.0%, while accuracy for other stages ranged from 70% to 80%. GPT-4o struggled with misclassifications, particularly between adjacent DR levels. It performed best in detecting proliferative DR (Level 4), achieving an F1 score above 0.3 and accuracy exceeding 80%. In binary classification tasks (Prompts 4.1-4.4), GPT-4o's performance improved, though it still had difficulty distinguishing mild DR (49.8% accuracy). When compared to FDA-approved AI systems, GPT-4o's sensitivity (47.7%) and specificity (73.8%) were significantly lower.
While GPT-4o shows promise identifying severe DR, limitations in distinguishing early stages exist and highlight the need for further refinement before clinical usage in DR screening. Unlike traditional CNN-based tools like IDx-DR, GPT-4o is a multimodal foundation model with a fundamentally different architecture and training process, which may contribute to its diagnostic limitations. GPT-4o and other LLMs are not designed to learn about important DR features like microaneurysms or hemorrhages using pixel data which is why they may struggle to detect DR compared to CNN models.
糖尿病视网膜病变(DR)是工作年龄成年人视力丧失的主要原因。尽管早期DR检测很重要,但由于眼科护理人员能力有限,只有60%的糖尿病患者接受了推荐的年度筛查。已开发出经美国食品药品监督管理局(FDA)批准的人工智能系统,以满足对DR筛查日益增长的需求;然而,高成本和专用设备限制了其可及性。需要评估更易获取且同样准确的人工智能系统,以消除这种差距。本研究评估了ChatGPT-4 Omni(GPT-4o)从彩色眼底照片(CFP)中对DR进行分类的诊断准确性,以评估其作为低成本替代筛查工具的潜力。
我们利用了来自Kaggle的公开可用的EyePACS DR检测竞赛数据集,其中包括2500张CFP,代表无DR、轻度DR、中度DR、重度DR和增殖性DR。每张图像都以8个旨在提高模型准确性的提示之一呈现给GPT-4o。通过混淆矩阵分析结果,并计算准确性、精确性、敏感性、特异性和F1分数等指标来评估性能。
在提示1-3中,GPT-4o对将图像分类为无DR表现出强烈的偏向性,平均准确率为51.0%,而其他阶段的准确率在70%至80%之间。GPT-4o在错误分类方面存在困难,尤其是在相邻的DR级别之间。它在检测增殖性DR(4级)方面表现最佳,F1分数高于0.3,准确率超过80%。在二分类任务(提示4.1-4.4)中,GPT-4o的性能有所提高,不过它在区分轻度DR方面仍然存在困难(准确率为49.8%)。与FDA批准的人工智能系统相比,GPT-4o的敏感性(47.7%)和特异性(73.8%)显著更低。
虽然GPT-4o在识别重度DR方面显示出前景,但在区分早期阶段存在局限性,这突出表明在DR筛查临床应用之前需要进一步改进。与IDx-DR等传统基于卷积神经网络(CNN)的工具不同,GPT-4o是一个多模态基础模型,具有根本不同的架构和训练过程,这可能导致其诊断局限性。GPT-4o和其他语言模型并非设计用于使用像素数据了解微动脉瘤或出血等重要的DR特征,这就是为什么与CNN模型相比,它们在检测DR方面可能存在困难。