Jalili Jalil, Jiravarnsirikul Anuwat, Bowd Christopher, Chuter Benton, Belghith Akram, Goldbaum Michael H, Baxter Sally L, Weinreb Robert N, Zangwill Linda M, Christopher Mark
Division of Ophthalmology Informatics and Data Science, Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California.
Hamilton Glaucoma Center, Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California.
Ophthalmol Sci. 2024 Nov 29;5(2):100667. doi: 10.1016/j.xops.2024.100667. eCollection 2025 Mar-Apr.
The aim is to assess GPT-4V's (OpenAI) diagnostic accuracy and its capability to identify glaucoma-related features compared to expert evaluations.
Evaluation of multimodal large language models for reviewing fundus images in glaucoma.
A total of 300 fundus images from 3 public datasets (ACRIMA, ORIGA, and RIM-One v3) that included 139 glaucomatous and 161 nonglaucomatous cases were analyzed.
Preprocessing ensured each image was centered on the optic disc. GPT-4's vision-preview model (GPT-4V) assessed each image for various glaucoma-related criteria: image quality, image gradability, cup-to-disc ratio, peripapillary atrophy, disc hemorrhages, rim thinning (by quadrant and clock hour), glaucoma status, and estimated probability of glaucoma. Each image was analyzed twice by GPT-4V to evaluate consistency in its predictions. Two expert graders independently evaluated the same images using identical criteria. Comparisons between GPT-4V's assessments, expert evaluations, and dataset labels were made to determine accuracy, sensitivity, specificity, and Cohen kappa.
The main parameters measured were the accuracy, sensitivity, specificity, and Cohen kappa of GPT-4V in detecting glaucoma compared with expert evaluations.
GPT-4V successfully provided glaucoma assessments for all 300 fundus images across the datasets, although approximately 35% required multiple prompt submissions. GPT-4V's overall accuracy in glaucoma detection was slightly lower (0.68, 0.70, and 0.81, respectively) than that of expert graders (0.78, 0.80, and 0.88, for expert grader 1 and 0.72, 0.78, and 0.87, for expert grader 2, respectively), across the ACRIMA, ORIGA, and RIM-ONE datasets. In Glaucoma detection, GPT-4V showed variable agreement by dataset and expert graders, with Cohen kappa values ranging from 0.08 to 0.72. In terms of feature detection, GPT-4V demonstrated high consistency (repeatability) in image gradability, with an agreement accuracy of ≥89% and substantial agreement in rim thinning and cup-to-disc ratio assessments, although kappas were generally lower than expert-to-expert agreement.
GPT-4V shows promise as a tool in glaucoma screening and detection through fundus image analysis, demonstrating generally high agreement with expert evaluations of key diagnostic features, although agreement did vary substantially across datasets.
Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
旨在评估GPT-4V(OpenAI)的诊断准确性及其与专家评估相比识别青光眼相关特征的能力。
评估用于青光眼眼底图像审查的多模态大语言模型。
分析了来自3个公共数据集(ACRIMA、ORIGA和RIM-One v3)的总共300张眼底图像,其中包括139例青光眼病例和161例非青光眼病例。
预处理确保每张图像以视盘为中心。GPT-4的视觉预览模型(GPT-4V)根据各种青光眼相关标准评估每张图像:图像质量、图像可分级性、杯盘比、视乳头周围萎缩、视盘出血、边缘变薄(按象限和钟点)、青光眼状态以及青光眼的估计概率。GPT-4V对每张图像进行了两次分析,以评估其预测的一致性。两名专家评分员使用相同的标准独立评估相同的图像。对GPT-4V的评估、专家评估和数据集标签进行比较,以确定准确性、敏感性、特异性和科恩kappa系数。
所测量的主要参数是GPT-4V与专家评估相比检测青光眼的准确性、敏感性、特异性和科恩kappa系数。
GPT-4V成功地为数据集中所有300张眼底图像提供了青光眼评估,尽管约35%的图像需要多次提交提示。在ACRIMA、ORIGA和RIM-ONE数据集中,GPT-4V检测青光眼的总体准确性略低于专家评分员(专家评分员1分别为0.78、0.80和0.88,专家评分员2分别为0.72、0.78和0.87)(分别为0.68) 、0.70和0.81)。在青光眼检测中,GPT-4V显示出数据集和专家评分员之间的一致性存在差异,科恩kappa值范围为0.08至0.72。在特征检测方面,GPT-4V在图像可分级性方面表现出高度一致性(重复性),一致性准确率≥89%,在边缘变薄和杯盘比评估方面有实质性一致性,尽管kappa系数通常低于专家之间的一致性。
GPT-4V作为一种通过眼底图像分析进行青光眼筛查和检测的工具显示出前景,与专家对关键诊断特征的评估总体上高度一致,尽管不同数据集之间一致性差异很大。
在本文末尾的脚注和披露中可能会找到专有或商业披露信息。