Suppr超能文献

比较人工智能生成与临床医生创建的针对膝骨关节炎患者的个性化自我管理指导:盲法观察研究。

Comparing Artificial Intelligence-Generated and Clinician-Created Personalized Self-Management Guidance for Patients With Knee Osteoarthritis: Blinded Observational Study.

作者信息

Du Kai, Li Ao, Zuo Qi-Heng, Zhang Chen-Yu, Guo Ren, Chen Ping, Du Wei-Shuai, Li Shu-Ming

机构信息

Beijing University of Chinese Medicine, Beijing, China.

Beijing Hospital of Traditional Chinese Medicine, Beijing, China.

出版信息

J Med Internet Res. 2025 May 7;27:e67830. doi: 10.2196/67830.

Abstract

BACKGROUND

Knee osteoarthritis is a prevalent, chronic musculoskeletal disorder that impairs mobility and quality of life. Personalized patient education aims to improve self-management and adherence; yet, its delivery is often limited by time constraints, clinician workload, and the heterogeneity of patient needs. Recent advances in large language models offer potential solutions. GPT-4 (OpenAI), distinguished by its long-context reasoning and adoption in clinical artificial intelligence research, emerged as a leading candidate for personalized health communication. However, its application in generating condition-specific educational guidance remains underexplored, and concerns about misinformation, personalization limits, and ethical oversight remain.

OBJECTIVE

We evaluated GPT-4's ability to generate individualized self-management guidance for patients with knee osteoarthritis in comparison with clinician-created content.

METHODS

This 2-phase, double-blind, observational study used data from 50 patients previously enrolled in a registered randomized trial. In phase 1, 2 orthopedic clinicians each generated personalized education materials for 25 patient profiles using anonymized clinical data, including history, symptoms, and lifestyle. In phase 2, the same datasets were processed by GPT-4 using standardized prompts. All content was anonymized and evaluated by 2 independent, blinded clinical experts using validated scoring systems. Evaluation criteria included efficiency, readability (Flesch-Kincaid, Gunning Fog, Coleman-Liau, and Simple Measure of Gobbledygook), accuracy, personalization, and comprehensiveness and safety. Disagreements between reviewers were resolved through consensus or third-party adjudication.

RESULTS

GPT-4 outperformed clinicians in content generation speed (530.03 vs 37.29 words per min, P<.001). Readability was better on the Flesch-Kincaid (mean 11.56, SD 1.08 vs mean 12.67 SD 0.95), Gunning Fog (mean 12.47, SD 1.36 vs mean 14.56, SD 0.93), and Simple Measure of Gobbledygook (mean 13.33, SD 1.00 vs mean 13.81 SD 0.69) indices (all P<.001), though GPT-4 scored slightly higher on the Coleman-Liau Index (mean 15.90, SD 1.03 vs mean 15.15, SD 0.91). GPT-4 also outperformed clinicians in accuracy (mean 5.31, SD 1.73 vs mean 4.76, SD 1.10; P=.05, personalization (mean 54.32, SD 6.21 vs mean 33.20, SD 5.40; P<.001), comprehensiveness (mean 51.74, SD 6.47 vs mean 35.26, SD 6.66; P<.001), and safety (median 61, IQR 58-66 vs median 50, IQR 47-55.25; P<.001).

CONCLUSIONS

GPT-4 could generate personalized self-management guidance for knee osteoarthritis with greater efficiency, accuracy, personalization, comprehensiveness, and safety than clinician-generated content, as assessed using standardized, guideline-aligned evaluation frameworks. These findings underscore the potential of large language models to support scalable, high-quality patient education in chronic disease management. The observed lexical complexity suggests the need to refine outputs for populations with limited health literacy. As an exploratory, single-center study, these results warrant confirmation in larger, multicenter cohorts with diverse demographic profiles. Future implementation should be guided by ethical and operational safeguards, including data privacy, transparency, and the delineation of clinical responsibility. Hybrid models integrating artificial intelligence-generated content with clinician oversight may offer a pragmatic path forward.

摘要

背景

膝关节骨关节炎是一种常见的慢性肌肉骨骼疾病,会损害活动能力和生活质量。个性化患者教育旨在改善自我管理和依从性;然而,其实施往往受到时间限制、临床医生工作量以及患者需求异质性的制约。大语言模型的最新进展提供了潜在的解决方案。GPT-4(OpenAI)以其长上下文推理能力以及在临床人工智能研究中的应用而著称,成为个性化健康沟通的主要候选工具。然而,其在生成特定病情教育指导方面的应用仍未得到充分探索,对错误信息、个性化限制以及伦理监督的担忧依然存在。

目的

我们评估了GPT-4与临床医生创建的内容相比,为膝关节骨关节炎患者生成个性化自我管理指导的能力。

方法

这项两阶段、双盲、观察性研究使用了来自50名先前参与一项已注册随机试验的患者的数据。在第一阶段,2名骨科临床医生各自使用匿名临床数据(包括病史、症状和生活方式)为25个患者档案生成个性化教育材料。在第二阶段,GPT-4使用标准化提示处理相同的数据集。所有内容均进行了匿名处理,并由2名独立的、不知情的临床专家使用经过验证的评分系统进行评估。评估标准包括效率、可读性(弗莱什-金凯德、冈宁雾度、科尔曼-廖指数和简单晦涩度量)、准确性、个性化、全面性和安全性。评审人员之间的分歧通过协商一致或第三方裁决解决。

结果

在内容生成速度方面,GPT-4优于临床医生(每分钟530.03个单词对37.29个单词,P<0.001)。在弗莱什-金凯德指数(平均11.56,标准差1.08对平均12.67,标准差0.95)、冈宁雾度指数(平均12.47,标准差1.36对平均14.56,标准差0.93)和简单晦涩度量指数(平均13.33,标准差1.00对平均13.81,标准差0.69)上,GPT-4的可读性更好(所有P<0.001),不过在科尔曼-廖指数上GPT-4得分略高(平均15.90,标准差1.03对平均15.15,标准差0.91)。在准确性(平均5.31,标准差1.73对平均4.76,标准差1.10;P=0.05)、个性化(平均54.32,标准差6.21对平均33.20,标准差5.40;P<0.001)、全面性(平均51.74,标准差6.47对平均35.26,标准差6.66;P<0.001)和安全性(中位数61,四分位间距58 - 66对中位数50,四分位间距47 - 55.25;P<0.001)方面,GPT-4也优于临床医生。

结论

按照标准化的、符合指南的评估框架进行评估,GPT-4能够比临床医生生成的内容更高效、准确、个性化、全面且安全地为膝关节骨关节炎患者生成个性化自我管理指导。这些发现强调了大语言模型在慢性病管理中支持可扩展的、高质量患者教育的潜力。观察到的词汇复杂性表明需要针对健康素养有限的人群优化输出内容。作为一项探索性的单中心研究,这些结果需要在更大规模、具有不同人口统计学特征的多中心队列中得到证实。未来的实施应以伦理和操作保障为指导,包括数据隐私、透明度以及临床责任的界定。将人工智能生成的内容与临床医生监督相结合的混合模型可能提供一条务实的前进道路。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ec6/12096024/7b53f87a52e6/jmir_v27i1e67830_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验