Daram Naveena R, Maxwell Rose A, D'Amato Josette, Massengill Jason C
Wright State University, Department of Obstetrics & Gynecology, Dayton, OH.
Wright State University, Department of Obstetrics & Gynecology, Dayton, OH.
Am J Obstet Gynecol. 2025 Jun 25. doi: 10.1016/j.ajog.2025.06.047.
BACKGROUND: The American Medical Association recommends that patient information be written at a sixth-grade level to increase accessibility. However, most existing patient education materials exceed this threshold, posing challenges to patient comprehension. Artificial intelligence, particularly large language models, presents an opportunity to improve the readability of medical information. Despite the growing integration of artificial intelligence in healthcare, few studies have evaluated the effectiveness of large language models in generating or improving readability of existing patient education materials within gynecology. OBJECTIVE: To assess the readability and effectiveness of patient education materials generated by ChatGPT, Gemini, and CoPilot compared to American College of Obstetricians and Gynecologists and UpToDate.com. Additionally, to determine whether these large language models can successfully adjust the reading level to a sixth-grade standard. STUDY DESIGN: This cross-sectional study analyzed American College of Obstetricians and Gynecologists, UpToDate, and large language model-generated content, evaluating large language models for 2 tasks: 1) independent large language model-generated materials and 2) large language model-enhanced versions reducing existing patient information to sixth-grade level. All materials were assessed for basic textual analysis and readability using 8 readability formulas. Two board-certified obstetrician-gynecologists evaluated blinded patient education materials for accuracy, clarity, and comprehension. Analysis of variance was used to compare textual analysis and readability scores, with Tukey post-hoc tests identifying differences for both original and enhanced materials. An alpha threshold of P<.004 was used to account for multiple comparisons. RESULTS: Large language model-generated materials were significantly shorter (mean word count 407.9 vs 1132.0; P<.001) but had a higher proportion of difficult words (36.7% vs 27.4%; P<.001). American College of Obstetricians and Gynecologists and UpToDate materials averaged ninth-grade and 8.6-grade levels, respectively, while artificial intelligence-generated content reached a 10.6-grade level (P = .008). Although CoPilot and Gemini improved readability when prompted, no large language model successfully reached the sixth-grade benchmark, and ChatGPT increased reading difficulty. CONCLUSION: Large language models generated more concise patient education materials but often introduced more complex vocabulary, ultimately failing to meet recommended health literacy standards. Even when explicitly prompted, no large language model achieved the sixth-grade reading level required for optimal patient comprehension. Without proper oversight, artificial intelligence-generated patient education materials may create the illusion of simplicity while reducing true accessibility. Future efforts should focus on integrating health literacy safeguards into artificial intelligence models before clinical implementation.
背景:美国医学协会建议患者信息应以六年级的阅读水平编写,以提高可及性。然而,大多数现有的患者教育材料超过了这个门槛,给患者理解带来了挑战。人工智能,尤其是大语言模型,为提高医学信息的可读性提供了契机。尽管人工智能在医疗保健领域的应用日益广泛,但很少有研究评估大语言模型在生成或提高妇科现有患者教育材料可读性方面的有效性。 目的:评估ChatGPT、Gemini和CoPilot生成的患者教育材料与美国妇产科医师学会和UpToDate.com相比的可读性和有效性。此外,确定这些大语言模型是否能成功将阅读水平调整到六年级标准。 研究设计:这项横断面研究分析了美国妇产科医师学会、UpToDate和大语言模型生成的内容,评估大语言模型在两项任务中的表现:1)大语言模型独立生成的材料;2)将现有患者信息简化到六年级水平的大语言模型增强版本。使用8种可读性公式对所有材料进行基本文本分析和可读性评估。两名获得董事会认证的妇产科医生对盲态的患者教育材料进行准确性、清晰度和理解性评估。采用方差分析比较文本分析和可读性得分,使用Tukey事后检验确定原始材料和增强材料的差异。采用P<0.004的显著性阈值来考虑多重比较。 结果:大语言模型生成的材料明显更短(平均单词数407.9对1132.0;P<0.001),但难词比例更高(36.7%对27.4%;P<0.001)。美国妇产科医师学会和UpToDate的材料平均阅读水平分别为九年级和8.6年级,而人工智能生成的内容达到了10.6年级水平(P = 0.008)。尽管CoPilot和Gemini在得到提示时提高了可读性,但没有一个大语言模型成功达到六年级基准,ChatGPT反而增加了阅读难度。 结论:大语言模型生成的患者教育材料更简洁,但往往引入了更复杂的词汇,最终未能达到推荐的健康素养标准。即使明确得到提示,也没有一个大语言模型达到最佳患者理解所需的六年级阅读水平。如果没有适当的监督,人工智能生成的患者教育材料可能会营造出简单的假象,同时降低真正的可及性。未来的努力应集中在临床实施前将健康素养保障措施纳入人工智能模型。
Am J Obstet Gynecol. 2025-6-25
Clin Orthop Relat Res. 2023-11-1