评估ChatGPT-4 Omni利用彩色眼底照片对糖尿病视网膜病变眼底镜检查进行分级的诊断能力。

Assessing the Diagnostic Capabilities of ChatGPT-4 Omni in Grading Diabetic Retinopathy Fundoscopy Using Color Fundus Photographs.

作者信息

Chetla Nitin, Samayamanthula Sai S, Chang Joseph He, Leigh Arnold Y, Akosman Sinan, Tandon Mihir, Hage Tamer R, Cusick Michael

机构信息

University of Virginia School of Medicine, Charlottesville, VA, USA.

University of Passau, Passau, Germany.

出版信息

Clin Ophthalmol. 2025 Aug 31;19:3103-3112. doi: 10.2147/OPTH.S517238. eCollection 2025.

DOI:10.2147/OPTH.S517238

PMID:40917271

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12411675/

Abstract

PURPOSE

Diabetic retinopathy (DR) is a leading cause of vision loss in working-age adults. Despite the importance of early DR detection, only 60% of patients with diabetes receive recommended annual screenings due to limited eye care provider capacity. FDA-approved AI systems were developed to meet the growing demand for DR screening; however, high costs and specialized equipment limit accessibility. More accessible and equally as accurate AI systems need to be evaluated to combat this disparity. This study evaluated the diagnostic accuracy of ChatGPT-4 Omni (GPT-4o) in classifying DR from color fundus photographs (CFPs) to assess its potential as a low-cost alternative screening tool.

METHODS

We utilized the publicly available EyePACS DR detection competition dataset from Kaggle, which includes 2,500 CFPs representing no DR, mild DR, moderate DR, severe DR, and proliferative DR. Each image was presented to GPT-4o with 1 of 8 prompts designed to enhance the model's accuracy. The results were analyzed through confusion matrices, and metrics such as accuracy, precision, sensitivity, specificity, and F1 scores were calculated to evaluate performance.

RESULTS

In prompts 1-3, GPT-4o showed a strong bias towards classifying images as no DR, with an average accuracy of 51.0%, while accuracy for other stages ranged from 70% to 80%. GPT-4o struggled with misclassifications, particularly between adjacent DR levels. It performed best in detecting proliferative DR (Level 4), achieving an F1 score above 0.3 and accuracy exceeding 80%. In binary classification tasks (Prompts 4.1-4.4), GPT-4o's performance improved, though it still had difficulty distinguishing mild DR (49.8% accuracy). When compared to FDA-approved AI systems, GPT-4o's sensitivity (47.7%) and specificity (73.8%) were significantly lower.

CONCLUSION

While GPT-4o shows promise identifying severe DR, limitations in distinguishing early stages exist and highlight the need for further refinement before clinical usage in DR screening. Unlike traditional CNN-based tools like IDx-DR, GPT-4o is a multimodal foundation model with a fundamentally different architecture and training process, which may contribute to its diagnostic limitations. GPT-4o and other LLMs are not designed to learn about important DR features like microaneurysms or hemorrhages using pixel data which is why they may struggle to detect DR compared to CNN models.

摘要

目的

糖尿病视网膜病变（DR）是工作年龄成年人视力丧失的主要原因。尽管早期DR检测很重要，但由于眼科护理人员能力有限，只有60%的糖尿病患者接受了推荐的年度筛查。已开发出经美国食品药品监督管理局（FDA）批准的人工智能系统，以满足对DR筛查日益增长的需求；然而，高成本和专用设备限制了其可及性。需要评估更易获取且同样准确的人工智能系统，以消除这种差距。本研究评估了ChatGPT-4 Omni（GPT-4o）从彩色眼底照片（CFP）中对DR进行分类的诊断准确性，以评估其作为低成本替代筛查工具的潜力。

方法

我们利用了来自Kaggle的公开可用的EyePACS DR检测竞赛数据集，其中包括2500张CFP，代表无DR、轻度DR、中度DR、重度DR和增殖性DR。每张图像都以8个旨在提高模型准确性的提示之一呈现给GPT-4o。通过混淆矩阵分析结果，并计算准确性、精确性、敏感性、特异性和F1分数等指标来评估性能。

结果

在提示1-3中，GPT-4o对将图像分类为无DR表现出强烈的偏向性，平均准确率为51.0%，而其他阶段的准确率在70%至80%之间。GPT-4o在错误分类方面存在困难，尤其是在相邻的DR级别之间。它在检测增殖性DR（4级）方面表现最佳，F1分数高于0.3，准确率超过80%。在二分类任务（提示4.1-4.4）中，GPT-4o的性能有所提高，不过它在区分轻度DR方面仍然存在困难（准确率为49.8%）。与FDA批准的人工智能系统相比，GPT-4o的敏感性（47.7%）和特异性（73.8%）显著更低。

结论

虽然GPT-4o在识别重度DR方面显示出前景，但在区分早期阶段存在局限性，这突出表明在DR筛查临床应用之前需要进一步改进。与IDx-DR等传统基于卷积神经网络（CNN）的工具不同，GPT-4o是一个多模态基础模型，具有根本不同的架构和训练过程，这可能导致其诊断局限性。GPT-4o和其他语言模型并非设计用于使用像素数据了解微动脉瘤或出血等重要的DR特征，这就是为什么与CNN模型相比，它们在检测DR方面可能存在困难。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a638/12411675/4dde86e098d6/OPTH-19-3103-g0001.jpg

相似文献

Assessing the Diagnostic Capabilities of ChatGPT-4 Omni in Grading Diabetic Retinopathy Fundoscopy Using Color Fundus Photographs.评估ChatGPT-4 Omni利用彩色眼底照片对糖尿病视网膜病变眼底镜检查进行分级的诊断能力。

Clin Ophthalmol. 2025 Aug 31;19:3103-3112. doi: 10.2147/OPTH.S517238. eCollection 2025.

Prescription of Controlled Substances: Benefits and Risks管制药品的处方：益处与风险

Exploring GPT-4o's multimodal reasoning capabilities with panoramic radiograph: the role of prompt engineering.利用全景X线片探索GPT-4o的多模态推理能力：提示工程的作用。

Clin Oral Investig. 2025 Aug 12;29(9):405. doi: 10.1007/s00784-025-06498-9.

Diabetic retinopathy screening through artificial intelligence algorithms: A systematic review.基于人工智能算法的糖尿病视网膜病变筛查：系统综述。

Surv Ophthalmol. 2024 Sep-Oct;69(5):707-721. doi: 10.1016/j.survophthal.2024.05.008. Epub 2024 Jun 15.

Effectiveness of the GPT-4o Model in Interpreting Electrocardiogram Images for Cardiac Diagnostics: Diagnostic Accuracy Study.GPT-4o模型在解读心电图图像用于心脏诊断中的有效性：诊断准确性研究

JMIR AI. 2025 Aug 22;4:e74426. doi: 10.2196/74426.

Role of Artificial Intelligence in Surgical Training by Assessing GPT-4 and GPT-4o on the Japan Surgical Board Examination With Text-Only and Image-Accompanied Questions: Performance Evaluation Study.通过在日本外科医师资格考试中使用纯文本和图文并茂的问题评估GPT-4和GPT-4o来研究人工智能在外科培训中的作用：性能评估研究

JMIR Med Educ. 2025 Jul 30;11:e69313. doi: 10.2196/69313.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中，如果患者出现以下症状和体征，可判断其是否患有 COVID-19。

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.使用标准化多项选择题评估大型语言模型在精神病学中的准确性和可靠性：横断面研究

J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910.

Artificial intelligence for diagnosing exudative age-related macular degeneration.人工智能在渗出性年龄相关性黄斑变性诊断中的应用。

Cochrane Database Syst Rev. 2024 Oct 17;10(10):CD015522. doi: 10.1002/14651858.CD015522.pub2.

Classifying Patient Complaints Using Artificial Intelligence-Powered Large Language Models: Cross-Sectional Study.使用人工智能驱动的大语言模型对患者投诉进行分类：横断面研究

J Med Internet Res. 2025 Aug 6;27:e74231. doi: 10.2196/74231.

本文引用的文献

Optimizing GPT-4 Turbo Diagnostic Accuracy in Neuroradiology through Prompt Engineering and Confidence Thresholds.通过提示工程和置信阈值优化GPT-4 Turbo在神经放射学中的诊断准确性。

Diagnostics (Basel). 2024 Jul 17;14(14):1541. doi: 10.3390/diagnostics14141541.

Evaluating the strengths and limitations of multimodal ChatGPT-4 in detecting glaucoma using fundus images.评估多模态ChatGPT-4在使用眼底图像检测青光眼方面的优势和局限性。

Front Ophthalmol (Lausanne). 2024 Jun 7;4:1387190. doi: 10.3389/fopht.2024.1387190. eCollection 2024.

Evaluation of GPT-4's Chest X-Ray Impression Generation: A Reader Study on Performance and Perception.评估 GPT-4 生成的胸部 X 光印象：一项关于性能和感知的读者研究。

J Med Internet Res. 2023 Dec 22;25:e50865. doi: 10.2196/50865.

ChatGPT in ophthalmology: the dawn of a new era?眼科领域的ChatGPT：新时代的曙光？

Eye (Lond). 2024 Jan;38(1):4-7. doi: 10.1038/s41433-023-02619-4. Epub 2023 Jun 27.

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.ChatGPT在美国医师执照考试中的表现：使用大语言模型进行人工智能辅助医学教育的潜力。

PLOS Digit Health. 2023 Feb 9;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. eCollection 2023 Feb.

Artificial Intelligence Detection of Diabetic Retinopathy: Subgroup Comparison of the EyeArt System with Ophthalmologists' Dilated Examinations.人工智能检测糖尿病视网膜病变：EyeArt系统与眼科医生散瞳检查的亚组比较

Ophthalmol Sci. 2022 Sep 30;3(1):100228. doi: 10.1016/j.xops.2022.100228. eCollection 2023 Mar.

Pivotal Evaluation of an Artificial Intelligence System for Autonomous Detection of Referrable and Vision-Threatening Diabetic Retinopathy.自主检测可转诊和威胁视力的糖尿病视网膜病变的人工智能系统的关键性评估。

JAMA Netw Open. 2021 Nov 1;4(11):e2134254. doi: 10.1001/jamanetworkopen.2021.34254.

Artificial intelligence for diabetic retinopathy screening: a review.人工智能在糖尿病视网膜病变筛查中的应用：综述。

Eye (Lond). 2020 Mar;34(3):451-460. doi: 10.1038/s41433-019-0566-0. Epub 2019 Sep 5.

Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices.在基层医疗诊所中用于检测糖尿病视网膜病变的基于人工智能的自主诊断系统的关键试验。

NPJ Digit Med. 2018 Aug 28;1:39. doi: 10.1038/s41746-018-0040-6. eCollection 2018.

Guidelines on Diabetic Eye Care: The International Council of Ophthalmology Recommendations for Screening, Follow-up, Referral, and Treatment Based on Resource Settings.糖尿病眼病护理指南：国际眼科理事会基于资源设置的筛查、随访、转诊和治疗建议。

Ophthalmology. 2018 Oct;125(10):1608-1622. doi: 10.1016/j.ophtha.2018.04.007. Epub 2018 May 24.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

评估ChatGPT-4 Omni利用彩色眼底照片对糖尿病视网膜病变眼底镜检查进行分级的诊断能力。

Assessing the Diagnostic Capabilities of ChatGPT-4 Omni in Grading Diabetic Retinopathy Fundoscopy Using Color Fundus Photographs.

作者信息

机构信息

出版信息

PURPOSE

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

本文引用的文献