Suppr超能文献

评估用于脊柱手术的人工智能生成的患者教育材料:ChatGPT和DeepSeek模型之间可读性和DISCERN质量的比较分析。

Evaluating AI-generated patient education materials for spinal surgeries: Comparative analysis of readability and DISCERN quality across ChatGPT and deepseek models.

作者信息

Zhou Mi, Pan Yun, Zhang Yuye, Song Xiaomei, Zhou Youbin

机构信息

Allied Health & Human Performance, University of South Australia, Adelaide, Australia.

Department of Cardiovascular Medicine, The Second Affiliated Hospital of Soochow University, Suzhou, Jiangsu, China.

出版信息

Int J Med Inform. 2025 Jun;198:105871. doi: 10.1016/j.ijmedinf.2025.105871. Epub 2025 Mar 13.

Abstract

BACKGROUND

Access to patient-centered health information is essential for informed decision-making. However, online medical resources vary in quality and often fail to accommodate differing degrees of health literacy. This issue is particularly evident in surgical contexts, where complex terminology obstructs patient comprehension. With the increasing reliance on AI models for supplementary medical information, the reliability and readability of AI-generated content require thorough evaluation.

OBJECTIVE

This study aimed to evaluate four natural language processing models-ChatGPT-4o, ChatGPT-o3 mini, DeepSeek-V3, and DeepSeek-R1-in generating patient education materials for three common spinal surgeries: lumbar discectomy, spinal fusion, and decompressive laminectomy. Information quality was evaluated using the DISCERN score, and readability was assessed through Flesch-Kincaid indices.

RESULTS

DeepSeek-R1 produced the most readable responses, with Flesch-Kincaid Grade Level (FKGL) scores ranging from 7.2 to 9.0, succeeded by ChatGPT-4o. In contrast, ChatGPT-o3 exhibited the lowest readability (FKGL > 10.4). The DISCERN scores for all AI models were below 60, classifying the information quality as "fair," primarily due to insufficient cited references.

CONCLUSION

All models achieved merely a "fair" quality rating, underscoring the necessity for improvements in citation practices, and personalization. Nonetheless, DeepSeek-R1 and ChatGPT-4o generated more readable surgical information than ChatGPT-o3. Given that enhanced readability can improve patient engagement, reduce anxiety, and contribute to better surgical outcomes, these two models should be prioritized for assisting patients in the clinical.

LIMITATION & FUTURE DIRECTION: This study is limited by the rapid evolution of AI models, its exclusive focus on spinal surgery education, and the absence of real-world patient feedback, which may affect the generalizability and long-term applicability of the findings. Future research ought to explore interactive, multimodal approaches and incorporate patient feedback to ensure that AI-generated health information is accurate, accessible, and facilitates informed healthcare decisions.

摘要

背景

获取以患者为中心的健康信息对于做出明智的决策至关重要。然而,在线医疗资源质量参差不齐,往往无法适应不同健康素养水平的人群。这个问题在外科手术领域尤为明显,复杂的术语阻碍了患者的理解。随着对人工智能模型提供补充医疗信息的依赖日益增加,人工智能生成内容的可靠性和可读性需要进行全面评估。

目的

本研究旨在评估四种自然语言处理模型——ChatGPT-4o、ChatGPT-o3 mini、DeepSeek-V3和DeepSeek-R1——为三种常见脊柱手术(腰椎间盘切除术、脊柱融合术和减压性椎板切除术)生成患者教育材料的情况。使用DISCERN评分评估信息质量,并通过弗莱什-金凯德指数评估可读性。

结果

DeepSeek-R1生成的回复可读性最强,弗莱什-金凯德年级水平(FKGL)分数在7.2至9.0之间,其次是ChatGPT-4o。相比之下,ChatGPT-o3的可读性最低(FKGL>10.4)。所有人工智能模型的DISCERN评分均低于60,信息质量被归类为“一般”,主要原因是引用参考文献不足。

结论

所有模型的质量评级仅为“一般”,这凸显了改进引用做法和个性化的必要性。尽管如此,DeepSeek-R1和ChatGPT-4o生成的手术信息比ChatGPT-o3更具可读性。鉴于提高可读性可以提高患者参与度、减轻焦虑并有助于改善手术效果,在临床中应优先使用这两种模型来帮助患者。

局限性与未来方向

本研究受到人工智能模型快速发展、仅专注于脊柱手术教育以及缺乏真实患者反馈的限制,这可能会影响研究结果的普遍性和长期适用性。未来的研究应该探索交互式、多模态方法,并纳入患者反馈,以确保人工智能生成的健康信息准确、可获取,并有助于做出明智的医疗决策。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验