Will John, Gupta Mahin, Zaretsky Jonah, Dowlath Aliesha, Testa Paul, Feldman Jonah
Medical Center Information Technology Department of Health Informatics, New York University Langone Health, New York, NY, United States.
Division of Hospital Medicine, Department of Medicine, New York University Langone Health, New York, NY, United States.
J Med Internet Res. 2025 Jun 4;27:e69955. doi: 10.2196/69955.
Online accessible patient education materials (PEMs) are essential for patient empowerment. However, studies have shown that these materials often exceed the recommended sixth-grade reading level, making them difficult for many patients to understand. Large language models (LLMs) have the potential to simplify PEMs into more readable educational content.
We sought to evaluate whether 3 LLMs (ChatGPT [OpenAI], Gemini [Google], and Claude [Anthropic PBC]) can optimize the readability of PEMs to the recommended reading level without compromising accuracy.
This cross-sectional study used 60 randomly selected PEMs available online from 3 websites. We prompted LLMs to simplify the reading level of online PEMs. The primary outcome was the readability of the original online PEMs compared with the LLM-simplified versions. Readability scores were calculated using 4 validated indices Flesch Reading Ease, Flesch-Kincaid Grade Level, Gunning Fog Index, and Simple Measure of Gobbledygook Index. Accuracy and understandability were also assessed as balancing measures, with understandability measured using the Patient Education Materials Assessment Tool-Understandability (PEMAT-U).
The original readability scores for the American Heart Association (AHA), American Cancer Society (ACS), and American Stroke Association (ASA) websites were above the recommended sixth-grade level, with mean grade level scores of 10.7,10.0, and 9.6, respectively. After optimization by the LLMs, readability scores significantly improved across all 3 websites when compared with the original text. Compared with the original website, Wilcoxon signed rank test showed ChatGPT improved the readability to 7.6 from 10.1 (P<.001); Gemini, to 6.6 (P<.001); and Claude, to 5.6 (P<.001). Word counts were significantly reduced by all LLMs, with a decrease from a mean range of 410.9-953.9 words to a mean range of 201.9-248.1 words. None of the ChatGPT LLM-simplified PEMs were inaccurate, while 3.3% of Gemini and Claude LLM-simplified PEMs were inaccurate. Baseline understandability scores, as measured by PEMAT-U, were preserved across all LLM-simplified versions.
This cross-sectional study demonstrates that LLMs have the potential to significantly enhance the readability of online PEMs while maintaining accuracy and understandability, making them more accessible to a broader audience. However, variability in model performance and demonstrated inaccuracies underscore the need for human review of LLM output. Further study is needed to explore advanced LLM techniques and models trained for medical content.
在线可获取的患者教育材料(PEMs)对于增强患者能力至关重要。然而,研究表明这些材料的阅读水平常常超过推荐的六年级水平,这使得许多患者难以理解。大语言模型(LLMs)有潜力将PEMs简化为更易读的教育内容。
我们试图评估3种大语言模型(ChatGPT [OpenAI]、Gemini [谷歌]和Claude [Anthropic PBC])能否在不影响准确性的情况下将PEMs的可读性优化到推荐的阅读水平。
这项横断面研究使用了从3个网站随机选取的60份在线PEMs。我们促使大语言模型简化在线PEMs的阅读水平。主要结果是将原始在线PEMs与大语言模型简化版本的可读性进行比较。使用4种经过验证的指标计算可读性得分,即弗莱什易读性、弗莱什-金凯德年级水平、冈宁雾度指数和晦涩难懂指数简易测量法。准确性和可理解性也作为平衡指标进行评估,可理解性使用患者教育材料评估工具-可理解性(PEMAT-U)进行测量。
美国心脏协会(AHA)、美国癌症协会(ACS)和美国中风协会(ASA)网站的原始可读性得分高于推荐的六年级水平,平均年级水平得分分别为10.7、10.0和9.6。经过大语言模型优化后,与原文相比,所有3个网站的可读性得分均显著提高。与原始网站相比,威尔科克森符号秩检验显示ChatGPT将可读性从10.1提高到7.6(P<0.001);Gemini提高到6.6(P<0.001);Claude提高到5.6(P<0.001)。所有大语言模型均显著减少了单词数量从平均410.9 - 953.9个单词减少到平均201.9 - 248.1个单词。ChatGPT简化的PEMs均无不准确之处,而Gemini和Claude简化的PEMs中有3.3%不准确。通过PEMAT-U测量的基线可理解性得分在所有大语言模型简化版本中均得以保留。
这项横断面研究表明,大语言模型有潜力显著提高在线PEMs的可读性,同时保持准确性和可理解性,使其更易于广大受众获取。然而,模型性能的差异和出现的不准确情况凸显了对大语言模型输出进行人工审核的必要性。需要进一步研究探索针对医学内容训练的先进大语言模型技术和模型。