Pritzker School of Medicine, University of Chicago, Chicago, IL, United States.
Section of Dermatology, University of Chicago Medical Center, Chicago, IL, United States.
JMIR Dermatol. 2024 May 16;7:e55898. doi: 10.2196/55898.
Dermatologic patient education materials (PEMs) are often written above the national average seventh- to eighth-grade reading level. ChatGPT-3.5, GPT-4, DermGPT, and DocsGPT are large language models (LLMs) that are responsive to user prompts. Our project assesses their use in generating dermatologic PEMs at specified reading levels.
This study aims to assess the ability of select LLMs to generate PEMs for common and rare dermatologic conditions at unspecified and specified reading levels. Further, the study aims to assess the preservation of meaning across such LLM-generated PEMs, as assessed by dermatology resident trainees.
The Flesch-Kincaid reading level (FKRL) of current American Academy of Dermatology PEMs was evaluated for 4 common (atopic dermatitis, acne vulgaris, psoriasis, and herpes zoster) and 4 rare (epidermolysis bullosa, bullous pemphigoid, lamellar ichthyosis, and lichen planus) dermatologic conditions. We prompted ChatGPT-3.5, GPT-4, DermGPT, and DocsGPT to "Create a patient education handout about [condition] at a [FKRL]" to iteratively generate 10 PEMs per condition at unspecified fifth- and seventh-grade FKRLs, evaluated with Microsoft Word readability statistics. The preservation of meaning across LLMs was assessed by 2 dermatology resident trainees.
The current American Academy of Dermatology PEMs had an average (SD) FKRL of 9.35 (1.26) and 9.50 (2.3) for common and rare diseases, respectively. For common diseases, the FKRLs of LLM-produced PEMs ranged between 9.8 and 11.21 (unspecified prompt), between 4.22 and 7.43 (fifth-grade prompt), and between 5.98 and 7.28 (seventh-grade prompt). For rare diseases, the FKRLs of LLM-produced PEMs ranged between 9.85 and 11.45 (unspecified prompt), between 4.22 and 7.43 (fifth-grade prompt), and between 5.98 and 7.28 (seventh-grade prompt). At the fifth-grade reading level, GPT-4 was better at producing PEMs for both common and rare conditions than ChatGPT-3.5 (P=.001 and P=.01, respectively), DermGPT (P<.001 and P=.03, respectively), and DocsGPT (P<.001 and P=.02, respectively). At the seventh-grade reading level, no significant difference was found between ChatGPT-3.5, GPT-4, DocsGPT, or DermGPT in producing PEMs for common conditions (all P>.05); however, for rare conditions, ChatGPT-3.5 and DocsGPT outperformed GPT-4 (P=.003 and P<.001, respectively). The preservation of meaning analysis revealed that for common conditions, DermGPT ranked the highest for overall ease of reading, patient understandability, and accuracy (14.75/15, 98%); for rare conditions, handouts generated by GPT-4 ranked the highest (14.5/15, 97%).
GPT-4 appeared to outperform ChatGPT-3.5, DocsGPT, and DermGPT at the fifth-grade FKRL for both common and rare conditions, although both ChatGPT-3.5 and DocsGPT performed better than GPT-4 at the seventh-grade FKRL for rare conditions. LLM-produced PEMs may reliably meet seventh-grade FKRLs for select common and rare dermatologic conditions and are easy to read, understandable for patients, and mostly accurate. LLMs may play a role in enhancing health literacy and disseminating accessible, understandable PEMs in dermatology.
皮肤科患者教育材料(PEM)的编写水平通常高于全国平均水平的七至八年级阅读水平。ChatGPT-3.5、GPT-4、DermGPT 和 DocsGPT 是响应用户提示的大型语言模型(LLM)。我们的项目评估了它们在指定阅读水平下生成常见和罕见皮肤病 PEM 的能力。
本研究旨在评估选定的 LLM 在未指定和指定阅读水平下生成常见和罕见皮肤病 PEM 的能力。此外,研究旨在评估皮肤科住院医师受训者评估的此类 LLM 生成的 PEM 中意义的保留情况。
评估了当前美国皮肤病学会 PEM 中 4 种常见(特应性皮炎、寻常痤疮、银屑病和带状疱疹)和 4 种罕见(大疱性表皮松解症、天疱疮、层状鱼鳞癣和扁平苔藓)皮肤病的 Flesch-Kincaid 阅读水平(FKRL)。我们提示 ChatGPT-3.5、GPT-4、DermGPT 和 DocsGPT“创建一个关于[疾病]的患者教育手册,阅读水平为[FKRL]”,以迭代方式为每个疾病生成 10 个未指定第五级和第七级 FKRL 的 PEM,使用 Microsoft Word 可读性统计进行评估。通过 2 名皮肤科住院医师受训者评估 LLM 之间意义的保留情况。
当前美国皮肤病学会 PEM 的平均(SD)FKRL 分别为 9.35(1.26)和 9.50(2.3),用于常见和罕见疾病。对于常见疾病,LLM 生成的 PEM 的 FKRL 范围在 9.8 到 11.21(未指定提示)、4.22 到 7.43(五年级提示)和 5.98 到 7.28(七年级提示)之间。对于罕见疾病,LLM 生成的 PEM 的 FKRL 范围在 9.85 到 11.45(未指定提示)、4.22 到 7.43(五年级提示)和 5.98 到 7.28(七年级提示)之间。在五年级阅读水平下,GPT-4 在生成常见和罕见疾病的 PEM 方面均优于 ChatGPT-3.5(P=.001 和 P=.01)、DermGPT(P<.001 和 P=.03)和 DocsGPT(P<.001 和 P=.02)。在七年级阅读水平下,ChatGPT-3.5、GPT-4、DocsGPT 或 DermGPT 之间在生成常见疾病的 PEM 方面没有显著差异(均 P>.05);然而,对于罕见疾病,ChatGPT-3.5 和 DocsGPT 的表现优于 GPT-4(P=.003 和 P<.001)。意义保留分析显示,对于常见疾病,总体阅读舒适度、患者理解度和准确性方面,DermGPT 排名最高(14.75/15,98%);对于罕见疾病,GPT-4 生成的 PEM 排名最高(14.5/15,97%)。
在常见和罕见疾病的五年级 FKRL 方面,GPT-4 似乎优于 ChatGPT-3.5、DocsGPT 和 DermGPT,尽管 ChatGPT-3.5 和 DocsGPT 在罕见疾病的七年级 FKRL 方面的表现均优于 GPT-4。LLM 生成的 PEM 可能可靠地满足常见和罕见皮肤病的第七级 FKRL 要求,并且易于阅读、患者易于理解,并且大部分内容准确。LLM 可以在皮肤科中发挥作用,提高健康素养并传播易懂的 PEM。