大型语言模型在提供符合指南的肌肉骨骼疼痛局部使用非甾体抗炎药建议方面的比较评估：多维分析

Comparative evaluation of large language models in delivering guideline-compliant recommendations for topical NSAID use in musculoskeletal pain: a multidimensional analysis.

作者信息

Dong Chengqi, Qiu Xu, Deng Jiayi, Xu Li, Dong Xiaoxue, Chen Shi, Mei Tao, Li Qinghua, Cheng Yuan, Sun Jianliang, Wang Hanbin, Yu Liang

机构信息

Department of Pain, Affiliated Hangzhou First People's Hospital, Westlake University School of Medicine, Hangzhou, China.

The Fourth Clinical School of Medical, Zhejiang Chinese Medical University, Hangzhou First People's Hospital, Hangzhou, China.

出版信息

Clin Rheumatol. 2025 Sep 15. doi: 10.1007/s10067-025-07640-4.

DOI:10.1007/s10067-025-07640-4

PMID:40952435

Abstract

INTRODUCTION

While large language models (LLMs) are increasingly used in clinical decision support, their adherence to evidence-based guidelines-particularly for musculoskeletal pain management-remains understudied.

METHODS

Four LLMs (DeepSeek-R1, ChatGPT-4o, Gemini, Grok-3) were evaluated on their responses to topical NSAID use for musculoskeletal pain through: assessments of response quality (accuracy, over-conclusiveness, supplementary information, and incompleteness), standardized readability metrics (Flesch Reading Ease, Flesch-Kincaid Grade Level), and the PEMAT-P tool to quantify actionability.

RESULTS

The four LLMs showed significant variability in accuracy (ANOVA p = 0.045), with Gemini scoring highest (8.33 ± 0.77) and DeepSeek-R1 lowest (7.72 ± 1.52) and in over-conclusiveness (ANOVA p = 0.025), with Grok-3 scoring lowest (4.56 ± 1.42) and ChatGPT-4o highest 6.72 ± 1.49). ChatGPT-4o provided the most supplementary content (6.94 ± 2.29, p = 0.106) and DeepSeek-R1 had the highest incompleteness (5.00 ± 2.52, p = 0.261). All models exceeded recommended readability thresholds (9th-10th grade level), and none met the actionability standard (≤ 33.5%).

CONCLUSIONS

LLMs demonstrate potential as clinical aids. The comprehensive performance of Gemini and Grok is relatively favorable, yet their readability and actionability remain unsatisfactory. Future development should integrate clinician feedback and real-world validation to ensure safety. Human oversight and targeted AI training are critical for safe implementation. Key Points • The study reveals significant differences in accuracy among LLMs, highlighting inconsistencies in clinical decision support. • While all models generated readable text, the complexity remained high, potentially limiting accessibility for some patients. • Glucocorticoid use for patients in remission was more strongly associated with impaired physical function in patients aged 75-84 than in patients aged 55-74 years. • Over-conclusiveness and incomplete adherence to evidence-based guidelines underscore the necessity for human oversight and targeted AI training in clinical applications.

摘要

引言

虽然大语言模型（LLMs）在临床决策支持中的应用越来越广泛，但它们对循证指南的遵循情况，尤其是在肌肉骨骼疼痛管理方面，仍有待深入研究。

方法

通过评估回答质量（准确性、过度确定性、补充信息和不完整性）、标准化可读性指标（弗莱什易读性、弗莱什-金凯德年级水平）以及用于量化可操作性的PEMAT-P工具，对四个大语言模型（深势科技-R1、ChatGPT-4o、Gemini、Grok-3）关于局部使用非甾体抗炎药治疗肌肉骨骼疼痛的回答进行了评估。

结果

四个大语言模型在准确性方面存在显著差异（方差分析p = 0.045），其中Gemini得分最高（8.33 ± 0.77），深势科技-R1得分最低（7.72 ± 1.52）；在过度确定性方面也存在显著差异（方差分析p = 0.025），Grok-3得分最低（4.56 ± 1.42），ChatGPT-4o得分最高（6.72 ± 1.49）。ChatGPT-4o提供的补充内容最多（6.94 ± 2.29，p = 0.106），深势科技-R1的不完整性最高（5.00 ± 2.52，p = 0.261）。所有模型都超过了推荐的可读性阈值（9 - 10年级水平），但没有一个模型达到可操作性标准（≤ 33.5%）。

结论

大语言模型显示出作为临床辅助工具的潜力。Gemini和Grok的综合表现相对较好，但其可读性和可操作性仍不尽人意。未来的发展应整合临床医生的反馈和实际验证以确保安全性。人为监督和针对性的人工智能训练对于安全实施至关重要。要点 • 研究揭示了大语言模型在准确性上的显著差异，凸显了临床决策支持中的不一致性。 • 虽然所有模型生成的文本都具有可读性，但复杂度仍然较高，可能会限制一些患者的理解。 • 75 - 84岁缓解期患者使用糖皮质激素与身体功能受损的关联比55 - 74岁患者更强。 • 过度确定性和对循证指南的不完全遵循强调了在临床应用中进行人为监督和针对性人工智能训练的必要性。