Raja Hina, Huang Xiaoqin, Delsoz Mohammad, Madadi Yeganeh, Poursoroush Asma, Munawar Asim, Kahook Malik Y, Yousefi Siamak
Department of Ophthalmology, Hamilton Eye Institute, University of Tennessee Health Science Center, Memphis, Tennessee.
Department of Mathematics and Computer Science, Fisk University, Nashville, Tennessee.
Ophthalmol Sci. 2024 Aug 22;5(1):100599. doi: 10.1016/j.xops.2024.100599. eCollection 2025 Jan-Feb.
To evaluate the capabilities of Chat Generative Pre-Trained Transformer (ChatGPT), as a large language model (LLM), for diagnosing glaucoma using the Ocular Hypertension Treatment Study (OHTS) dataset, and comparing the diagnostic capability of ChatGPT 3.5 and ChatGPT 4.0.
Prospective data collection study.
A total of 3170 eyes of 1585 subjects from the OHTS were included in this study.
We selected demographic, clinical, ocular, visual field, optic nerve head photo, and history of disease parameters of each participant and developed case reports by converting tabular data into textual format based on information from both eyes of all subjects. We then developed a procedure using the application programming interface of ChatGPT, a LLM-based chatbot, to automatically input prompts into a chat box. This was followed by querying 2 different generations of ChatGPT (versions 3.5 and 4.0) regarding the underlying diagnosis of each subject. We then evaluated the output responses based on several objective metrics.
Area under the receiver operating characteristic curve (AUC), accuracy, specificity, sensitivity, and F1 score.
Chat Generative Pre-Trained Transformer 3.5 achieved AUC of 0.74, accuracy of 66%, specificity of 64%, sensitivity of 85%, and F1 score of 0.72. Chat Generative Pre-Trained Transformer 4.0 obtained AUC of 0.76, accuracy of 87%, specificity of 90%, sensitivity of 61%, and F1 score of 0.92.
The accuracy of ChatGPT 4.0 in diagnosing glaucoma based on input data from OHTS was promising. The overall accuracy of ChatGPT 4.0 was higher than ChatGPT 3.5. However, ChatGPT 3.5 was found to be more sensitive than ChatGPT 4.0. In its current forms, ChatGPT may serve as a useful tool in exploring disease status of ocular hypertensive eyes when specific data are available for analysis. In the future, leveraging LLMs with multimodal capabilities, allowing for integration of imaging and diagnostic testing as part of the analyses, could further enhance diagnostic capabilities and enhance diagnostic accuracy.
Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
评估作为一种大语言模型的聊天生成预训练变换器(ChatGPT)使用眼压治疗研究(OHTS)数据集诊断青光眼的能力,并比较ChatGPT 3.5和ChatGPT 4.0的诊断能力。
前瞻性数据收集研究。
本研究纳入了来自OHTS的1585名受试者的3170只眼睛。
我们选择了每位参与者的人口统计学、临床、眼部、视野、视神经乳头照片和疾病史参数,并根据所有受试者双眼的信息将表格数据转换为文本格式,编写病例报告。然后,我们使用基于大语言模型的聊天机器人ChatGPT的应用程序编程接口开发了一个程序,将提示自动输入到聊天框中。随后,就每位受试者的潜在诊断询问了两代不同的ChatGPT(3.5版和4.0版)。然后,我们根据几个客观指标评估输出回复。
受试者操作特征曲线下面积(AUC)、准确性、特异性、敏感性和F1分数。
ChatGPT 3.5的AUC为0.74,准确性为66%,特异性为64%,敏感性为85%,F1分数为0.72。ChatGPT 4.0的AUC为0.76,准确性为87%,特异性为90%,敏感性为61%,F1分数为0.92。
基于OHTS的输入数据,ChatGPT 4.0诊断青光眼的准确性很有前景。ChatGPT 4.0的总体准确性高于ChatGPT 3.5。然而,发现ChatGPT 3.5比ChatGPT 4.0更敏感。就目前的形式而言,当有特定数据可供分析时,ChatGPT可作为探索高眼压性眼病病情的有用工具。未来,利用具有多模态功能的大语言模型,允许将成像和诊断测试作为分析的一部分进行整合,可能会进一步提高诊断能力并提高诊断准确性。
在本文末尾的脚注和披露中可能会发现专有或商业披露信息。