Rosalind Franklin University of Medicine and Science Chicago Medical School, North Chicago, Illinois, USA.
Deparment of Ophthalmology, University of Arkansas for Medical Sciences, Little Rock, AR, USA.
Br J Ophthalmol. 2024 Sep 20;108(10):1470-1476. doi: 10.1136/bjo-2024-325252.
BACKGROUND/AIMS: This was a cross-sectional comparative study. We evaluated the ability of three large language models (LLMs) (ChatGPT-3.5, ChatGPT-4, and Google Bard) to generate novel patient education materials (PEMs) and improve the readability of existing PEMs on paediatric cataract.
We compared LLMs' responses to three prompts. Prompt A requested they write a handout on paediatric cataract that was 'easily understandable by an average American.' Prompt B modified prompt A and requested the handout be written at a 'sixth-grade reading level, using the Simple Measure of Gobbledygook (SMOG) readability formula.' Prompt C rewrote existing PEMs on paediatric cataract 'to a sixth-grade reading level using the SMOG readability formula'. Responses were compared on their quality (DISCERN; 1 (low quality) to 5 (high quality)), understandability and actionability (Patient Education Materials Assessment Tool (≥70%: understandable, ≥70%: actionable)), accuracy (Likert misinformation; 1 (no misinformation) to 5 (high misinformation) and readability (SMOG, Flesch-Kincaid Grade Level (FKGL); grade level <7: highly readable).
All LLM-generated responses were of high-quality (median DISCERN ≥4), understandability (≥70%), and accuracy (Likert=1). All LLM-generated responses were not actionable (<70%). ChatGPT-3.5 and ChatGPT-4 prompt B responses were more readable than prompt A responses (p<0.001). ChatGPT-4 generated more readable responses (lower SMOG and FKGL scores; 5.59±0.5 and 4.31±0.7, respectively) than the other two LLMs (p<0.001) and consistently rewrote them to or below the specified sixth-grade reading level (SMOG: 5.14±0.3).
LLMs, particularly ChatGPT-4, proved valuable in generating high-quality, readable, accurate PEMs and in improving the readability of existing materials on paediatric cataract.
背景/目的:这是一项横断面比较研究。我们评估了三个大型语言模型(LLM)(ChatGPT-3.5、ChatGPT-4 和 Google Bard)生成新颖的患者教育材料(PEM)的能力,并改善了儿科白内障现有 PEM 的可读性。
我们比较了 LLM 对三个提示的响应。提示 A 要求他们撰写一份关于儿科白内障的讲义,该讲义“易于普通美国人理解”。提示 B 修改了提示 A,并要求讲义以“六年级阅读水平,使用简单测词法(SMOG)可读性公式”编写。提示 C 用 SMOG 可读性公式将现有的儿科白内障 PEM“改写为六年级阅读水平”。在质量(DISCERN;1(低质量)至 5(高质量))、可理解性和可操作性(患者教育材料评估工具(≥70%:可理解,≥70%:可操作))、准确性(Likert 错误信息;1(无错误信息)至 5(高错误信息)和可读性(SMOG,Flesch-Kincaid 年级水平(FKGL);年级水平<7:高度可读)方面比较响应。
所有 LLM 生成的响应质量均较高(中位数 DISCERN≥4),理解性(≥70%)和准确性(Likert=1)。所有 LLM 生成的响应均不可操作(<70%)。ChatGPT-3.5 和 ChatGPT-4 提示 B 的响应比提示 A 的响应更具可读性(p<0.001)。ChatGPT-4 生成的响应更具可读性(较低的 SMOG 和 FKGL 分数;分别为 5.59±0.5 和 4.31±0.7),而其他两个 LLM 则更具可读性(p<0.001),并始终将其改写为或低于指定的六年级阅读水平(SMOG:5.14±0.3)。
LLM,特别是 ChatGPT-4,在生成高质量、可读、准确的 PEM 以及改善儿科白内障现有材料的可读性方面具有很高的价值。