评估用于脊柱手术的人工智能生成的患者教育材料：ChatGPT和DeepSeek模型之间可读性和DISCERN质量的比较分析。

Evaluating AI-generated patient education materials for spinal surgeries: Comparative analysis of readability and DISCERN quality across ChatGPT and deepseek models.

作者信息

Zhou Mi, Pan Yun, Zhang Yuye, Song Xiaomei, Zhou Youbin

机构信息

Allied Health & Human Performance, University of South Australia, Adelaide, Australia.

Department of Cardiovascular Medicine, The Second Affiliated Hospital of Soochow University, Suzhou, Jiangsu, China.

出版信息

Int J Med Inform. 2025 Jun;198:105871. doi: 10.1016/j.ijmedinf.2025.105871. Epub 2025 Mar 13.

DOI:10.1016/j.ijmedinf.2025.105871

PMID:40107040

Abstract

BACKGROUND

Access to patient-centered health information is essential for informed decision-making. However, online medical resources vary in quality and often fail to accommodate differing degrees of health literacy. This issue is particularly evident in surgical contexts, where complex terminology obstructs patient comprehension. With the increasing reliance on AI models for supplementary medical information, the reliability and readability of AI-generated content require thorough evaluation.

OBJECTIVE

This study aimed to evaluate four natural language processing models-ChatGPT-4o, ChatGPT-o3 mini, DeepSeek-V3, and DeepSeek-R1-in generating patient education materials for three common spinal surgeries: lumbar discectomy, spinal fusion, and decompressive laminectomy. Information quality was evaluated using the DISCERN score, and readability was assessed through Flesch-Kincaid indices.

RESULTS

DeepSeek-R1 produced the most readable responses, with Flesch-Kincaid Grade Level (FKGL) scores ranging from 7.2 to 9.0, succeeded by ChatGPT-4o. In contrast, ChatGPT-o3 exhibited the lowest readability (FKGL > 10.4). The DISCERN scores for all AI models were below 60, classifying the information quality as "fair," primarily due to insufficient cited references.

CONCLUSION

All models achieved merely a "fair" quality rating, underscoring the necessity for improvements in citation practices, and personalization. Nonetheless, DeepSeek-R1 and ChatGPT-4o generated more readable surgical information than ChatGPT-o3. Given that enhanced readability can improve patient engagement, reduce anxiety, and contribute to better surgical outcomes, these two models should be prioritized for assisting patients in the clinical.

LIMITATION & FUTURE DIRECTION: This study is limited by the rapid evolution of AI models, its exclusive focus on spinal surgery education, and the absence of real-world patient feedback, which may affect the generalizability and long-term applicability of the findings. Future research ought to explore interactive, multimodal approaches and incorporate patient feedback to ensure that AI-generated health information is accurate, accessible, and facilitates informed healthcare decisions.

摘要

背景

获取以患者为中心的健康信息对于做出明智的决策至关重要。然而，在线医疗资源质量参差不齐，往往无法适应不同健康素养水平的人群。这个问题在外科手术领域尤为明显，复杂的术语阻碍了患者的理解。随着对人工智能模型提供补充医疗信息的依赖日益增加，人工智能生成内容的可靠性和可读性需要进行全面评估。

目的

本研究旨在评估四种自然语言处理模型——ChatGPT-4o、ChatGPT-o3 mini、DeepSeek-V3和DeepSeek-R1——为三种常见脊柱手术（腰椎间盘切除术、脊柱融合术和减压性椎板切除术）生成患者教育材料的情况。使用DISCERN评分评估信息质量，并通过弗莱什-金凯德指数评估可读性。

结果

DeepSeek-R1生成的回复可读性最强，弗莱什-金凯德年级水平（FKGL）分数在7.2至9.0之间，其次是ChatGPT-4o。相比之下，ChatGPT-o3的可读性最低（FKGL>10.4）。所有人工智能模型的DISCERN评分均低于60，信息质量被归类为“一般”，主要原因是引用参考文献不足。

结论

所有模型的质量评级仅为“一般”，这凸显了改进引用做法和个性化的必要性。尽管如此，DeepSeek-R1和ChatGPT-4o生成的手术信息比ChatGPT-o3更具可读性。鉴于提高可读性可以提高患者参与度、减轻焦虑并有助于改善手术效果，在临床中应优先使用这两种模型来帮助患者。

局限性与未来方向

本研究受到人工智能模型快速发展、仅专注于脊柱手术教育以及缺乏真实患者反馈的限制，这可能会影响研究结果的普遍性和长期适用性。未来的研究应该探索交互式、多模态方法，并纳入患者反馈，以确保人工智能生成的健康信息准确、可获取，并有助于做出明智的医疗决策。

相似文献

Evaluating AI-generated patient education materials for spinal surgeries: Comparative analysis of readability and DISCERN quality across ChatGPT and deepseek models.

Int J Med Inform. 2025 Jun;198:105871. doi: 10.1016/j.ijmedinf.2025.105871. Epub 2025 Mar 13.

American academy of Orthopedic Surgeons' OrthoInfo provides more readable information regarding meniscus injury than ChatGPT-4 while information accuracy is comparable.

J ISAKOS. 2025 Apr;11:100843. doi: 10.1016/j.jisako.2025.100843. Epub 2025 Feb 21.

Assessing the quality and readability of patient education materials on chemotherapy cardiotoxicity from artificial intelligence chatbots: An observational cross-sectional study.

Medicine (Baltimore). 2025 Apr 11;104(15):e42135. doi: 10.1097/MD.0000000000042135.

Improving readability in AI-generated medical information on fragility fractures: the role of prompt wording on ChatGPT's responses.

Osteoporos Int. 2025 Mar;36(3):403-410. doi: 10.1007/s00198-024-07358-0. Epub 2025 Jan 8.

Dr. Google to Dr. ChatGPT: assessing the content and quality of artificial intelligence-generated medical information on appendicitis.

Surg Endosc. 2024 May;38(5):2887-2893. doi: 10.1007/s00464-024-10739-5. Epub 2024 Mar 5.

ChatGPT as a patient education tool in colorectal cancer-An in-depth assessment of efficacy, quality and readability.

Colorectal Dis. 2025 Jan;27(1):e17267. doi: 10.1111/codi.17267. Epub 2024 Dec 17.

Assessing the Readability, Reliability, and Quality of AI-Modified and Generated Patient Education Materials for Endoscopic Skull Base Surgery.

Am J Rhinol Allergy. 2024 Nov;38(6):396-402. doi: 10.1177/19458924241273055. Epub 2024 Aug 21.

Generative artificial intelligence chatbots may provide appropriate informational responses to common vascular surgery questions by patients.

Vascular. 2025 Feb;33(1):229-237. doi: 10.1177/17085381241240550. Epub 2024 Mar 18.

The quality and readability of patient information provided by ChatGPT: can AI reliably explain common ENT operations?

Eur Arch Otorhinolaryngol. 2024 Nov;281(11):6147-6153. doi: 10.1007/s00405-024-08598-w. Epub 2024 Mar 26.

Evaluating the Efficacy of ChatGPT as a Patient Education Tool in Prostate Cancer: Multimetric Assessment.

J Med Internet Res. 2024 Aug 14;26:e55939. doi: 10.2196/55939.

引用本文的文献

Evaluating artificial intelligence chatbots' responses to gynecomastia inquiries: Comparative study of information quality, readability, and guideline consistency.

Digit Health. 2025 Aug 26;11:20552076251367645. doi: 10.1177/20552076251367645. eCollection 2025 Jan-Dec.

ChatGPT-4.0 or DeepSeek-V3? Comparative analysis of answers to the most frequently asked questions by total knee replacement candidate patients.

Medicine (Baltimore). 2025 Aug 22;104(34):e43951. doi: 10.1097/MD.0000000000043951.

Exploring the use of large language models for classification, clinical interpretation, and treatment recommendation in breast tumor patient records.

Sci Rep. 2025 Aug 26;15(1):31450. doi: 10.1038/s41598-025-16999-y.

Histological Image Classification Between Follicular Lymphoma and Reactive Lymphoid Tissue Using Deep Learning and Explainable Artificial Intelligence (XAI).

Cancers (Basel). 2025 Jul 22;17(15):2428. doi: 10.3390/cancers17152428.

Battle of the artificial intelligence: a comprehensive comparative analysis of DeepSeek and ChatGPT for urinary incontinence-related questions.

Front Public Health. 2025 Jul 23;13:1605908. doi: 10.3389/fpubh.2025.1605908. eCollection 2025.

DeepSeek-R1 outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in bilingual complex ophthalmology reasoning.

Adv Ophthalmol Pract Res. 2025 May 9;5(3):189-195. doi: 10.1016/j.aopr.2025.05.001. eCollection 2025 Aug-Sep.

Evaluating AI-Generated Patient Education Guides: A Comparative Study of ChatGPT and Deepseek.

Cureus. 2025 Jun 3;17(6):e85277. doi: 10.7759/cureus.85277. eCollection 2025 Jun.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

评估用于脊柱手术的人工智能生成的患者教育材料：ChatGPT和DeepSeek模型之间可读性和DISCERN质量的比较分析。

Evaluating AI-generated patient education materials for spinal surgeries: Comparative analysis of readability and DISCERN quality across ChatGPT and deepseek models.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

RESULTS

CONCLUSION

背景

目的

结果

结论

局限性与未来方向

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献