Kurapati Sai S, Barnett Derek J, Yaghy Antonio, Sabet Cameron J, Younessi David N, Nguyen Dang, Lin John C, Scott Ingrid U
Department of Ophthalmology, Penn State College of Medicine, Hershey, Pennsylvania, USA,
Department of Ophthalmology, Penn State College of Medicine, Hershey, Pennsylvania, USA.
Ophthalmologica. 2025;248(3):149-159. doi: 10.1159/000544917. Epub 2025 Mar 10.
Generative artificial intelligence (AI) technologies like GPT-4 can instantaneously provide health information to patients; however, the readability of these outputs compared to ophthalmologist-written responses is unknown. This study aimed to evaluate the readability of GPT-4-generated and ophthalmologist-written responses to patient queries about ophthalmic surgery.
This retrospective cross-sectional study used 200 randomly selected patient questions about ophthalmic surgery extracted from the American Academy of Ophthalmology's EyeSmart platform. The questions were inputted into GPT-4, and the generated responses were recorded. Ophthalmologist-written replies to the same questions were compiled for comparison. Readability of GPT-4 and ophthalmologist responses was assessed using six validated metrics: Flesch Kincaid Reading Ease (FK-RE), Flesch Kincaid Grade Level (FK-GL), Gunning Fog Score (GFS), SMOG Index (SI), Coleman Liau Index (CLI), and Automated Readability Index (ARI). Descriptive statistics, one-way ANOVA, Shapiro-Wilk, and Levene's tests (α = 0.05) were used to compare readability between the two groups.
GPT-4 used a higher percentage of complex words (24.42%) compared to ophthalmologists (17.76%), although mean (standard deviation) word count per sentence was similar (18.43 [2.95] and 18.01 [6.09]). Across all metrics (FK-RE; FK-GL; GFS; SI; CLI; and ARI), GPT-4 responses were at a higher grade level (34.39 [8.51]; 13.19 [2.63]; 16.37 [2.04]; 12.18 [1.43]; 15.72 [1.40]; 12.99 [1.86]) than ophthalmologists' responses (50.61 [15.53]; 10.71 [2.99]; 14.13 [3.55]; 10.07 [2.46]; 12.64 [2.93]; 10.40 [3.61]), with both sources necessitating a 12th-grade education for comprehension. ANOVA tests showed significance (p < 0.05) for all comparisons except word count (p = 0.438).
The National Institutes of Health advises health information to be written at a 6th- to 7th-grade level. Both GPT-4- and ophthalmologist-written answers exceeded this recommendation, with GPT-4 showing a greater gap. Information accessibility is vital when designing patient resources, particularly with the rise of AI as an educational tool.
像GPT-4这样的生成式人工智能(AI)技术可以即时向患者提供健康信息;然而,与眼科医生撰写的回复相比,这些输出内容的可读性尚不清楚。本研究旨在评估GPT-4生成的以及眼科医生撰写的针对患者有关眼科手术问题的回复的可读性。
这项回顾性横断面研究使用了从美国眼科学会的EyeSmart平台随机抽取的200个关于眼科手术的患者问题。将这些问题输入GPT-4,并记录生成的回复。收集眼科医生对相同问题的书面回复以作比较。使用六个经过验证的指标评估GPT-4和眼科医生回复的可读性:弗莱什-金凯德阅读简易度(FK-RE)、弗莱什-金凯德年级水平(FK-GL)、冈宁雾度评分(GFS)、烟雾指数(SI)、科尔曼-廖指数(CLI)和自动可读性指数(ARI)。使用描述性统计、单因素方差分析、夏皮罗-威尔克检验和莱文检验(α = 0.05)比较两组之间的可读性。
与眼科医生(17.76%)相比,GPT-4使用的复杂词汇百分比更高(24.42%),尽管平均每句单词数(标准差)相似(18.43 [2.95]和18.01 [6.09])。在所有指标(FK-RE;FK-GL;GFS;SI;CLI;和ARI)上,GPT-4的回复年级水平更高(34.39 [8.51];13.19 [2.63];16.37 [2.04];12.18 [1.43];15.72 [1.40];12.99 [1.86]),高于眼科医生的回复(50.61 [15.53];10.71 [2.99];14.13 [3.55];10.07 [2.46];12.64 [2.93];10.40 [3.61]),理解这两种来源的回复都需要十二年级的教育水平。方差分析测试显示,除单词数外(p = 0.438),所有比较均具有显著性(p < 0.05)。
美国国立卫生研究院建议健康信息应写成六年级至七年级的水平。GPT-4和眼科医生撰写的答案都超出了这一建议,GPT-4的差距更大。在设计患者资源时,信息的可获取性至关重要,尤其是随着人工智能作为一种教育工具的兴起。