Balas Michael, Mandelcorn Efrem D, Yan Peng, Ing Edsel B, Crawford Sean A, Arjmand Parnian
Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada.
Temerty Faculty of Medicine, University of Toronto, Toronto, ON, Canada; Department of Ophthalmology and Vision Sciences, University of Toronto, Toronto, ON, Canada; University Health Network, University of Toronto, Toronto, ON, Canada; Kensington Eye Institute, Toronto, ON, Canada.
Can J Ophthalmol. 2025 Feb;60(1):e117-e123. doi: 10.1016/j.jcjo.2024.06.001. Epub 2024 Aug 1.
To evaluate the performance of an artificial intelligence (AI) large language model, ChatGPT (version 4.0), for common retinal diseases, in accordance with the American Academy of Ophthalmology (AAO) Preferred Practice Pattern (PPP) guidelines.
A cross-sectional survey study design was employed to compare the responses made by ChatGPT to established clinical guidelines.
Responses by the AI were reviewed by a panel of three vitreoretinal specialists for evaluation.
To investigate ChatGPT's comprehension of clinical guidelines, we designed 130 questions covering a broad spectrum of topics within 12 AAO PPP domains of retinal disease These questions were crafted to encompass diagnostic criteria, treatment guidelines, and management strategies, including both medical and surgical aspects of retinal care. A panel of 3 retinal specialists independently evaluated responses on a Likert scale from 1 to 5 based on their relevance, accuracy, and adherence to AAO PPP guidelines. Response readability was evaluated using Flesch Readability Ease and Flesch-Kincaid grade level scores.
ChatGPT achieved an overall average score of 4.9/5.0, suggesting high alignment with the AAO PPP guidelines. Scores varied across domains, with the lowest in the surgical management of disease. The responses had a low reading ease score and required a college-to-graduate level of comprehension. Identified errors were related to diagnostic criteria, treatment options, and methodological procedures.
ChatGPT 4.0 demonstrated significant potential in generating guideline-concordant responses, particularly for common medical retinal diseases. However, its performance slightly decreased in surgical retina, highlighting the ongoing need for clinician input, further model refinement, and improved comprehensibility.
根据美国眼科学会(AAO)首选实践模式(PPP)指南,评估人工智能(AI)大语言模型ChatGPT(版本4.0)对常见视网膜疾病的表现。
采用横断面调查研究设计,比较ChatGPT给出的回答与既定临床指南。
由三名玻璃体视网膜专科医生组成的小组对AI的回答进行评估。
为研究ChatGPT对临床指南的理解,我们设计了130个问题,涵盖视网膜疾病12个AAO PPP领域内的广泛主题。这些问题旨在涵盖诊断标准、治疗指南和管理策略,包括视网膜护理的医学和手术方面。一个由3名视网膜专科医生组成的小组根据回答的相关性、准确性和对AAO PPP指南的遵循程度,采用1至5的李克特量表独立评估回答。使用弗莱什易读性和弗莱什-金凯德年级水平分数评估回答的可读性。
ChatGPT的总体平均得分为4.9/5.0,表明与AAO PPP指南高度一致。各领域得分有所不同,疾病手术管理领域得分最低。回答的易读性得分较低,需要大学到研究生水平的理解能力。发现的错误与诊断标准、治疗选择和方法程序有关。
ChatGPT 4.0在生成与指南一致的回答方面显示出巨大潜力,特别是对于常见的视网膜内科疾病。然而,其在视网膜外科方面的表现略有下降,这突出表明持续需要临床医生的投入、进一步优化模型以及提高可理解性。