• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

目前可用的大语言模型并未提供与循证临床实践指南相一致的肌肉骨骼治疗建议。

Currently Available Large Language Models Do Not Provide Musculoskeletal Treatment Recommendations That Are Concordant With Evidence-Based Clinical Practice Guidelines.

作者信息

Nwachukwu Benedict U, Varady Nathan H, Allen Answorth A, Dines Joshua S, Altchek David W, Williams Riley J, Kunze Kyle N

机构信息

Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A.

Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A..

出版信息

Arthroscopy. 2025 Feb;41(2):263-275.e6. doi: 10.1016/j.arthro.2024.07.040. Epub 2024 Aug 22.

DOI:10.1016/j.arthro.2024.07.040
PMID:39173690
Abstract

PURPOSE

To determine whether several leading, commercially available large language models (LLMs) provide treatment recommendations concordant with evidence-based clinical practice guidelines (CPGs) developed by the American Academy of Orthopaedic Surgeons (AAOS).

METHODS

All CPGs concerning the management of rotator cuff tears (n = 33) and anterior cruciate ligament injuries (n = 15) were extracted from the AAOS. Treatment recommendations from Chat-Generative Pretrained Transformer version 4 (ChatGPT-4), Gemini, Mistral-7B, and Claude-3 were graded by 2 blinded physicians as being concordant, discordant, or indeterminate (i.e., neutral response without definitive recommendation) with respect to AAOS CPGs. The overall concordance between LLM and AAOS recommendations was quantified, and the comparative overall concordance of recommendations among the 4 LLMs was evaluated through the Fisher exact test.

RESULTS

Overall, 135 responses (70.3%) were concordant, 43 (22.4%) were indeterminate, and 14 (7.3%) were discordant. Inter-rater reliability for concordance classification was excellent (κ = 0.92). Concordance with AAOS CPGs was most frequently observed with ChatGPT-4 (n = 38, 79.2%) and least frequently observed with Mistral-7B (n = 28, 58.3%). Indeterminate recommendations were most frequently observed with Mistral-7B (n = 17, 35.4%) and least frequently observed with Claude-3 (n = 8, 6.7%). Discordant recommendations were most frequently observed with Gemini (n = 6, 12.5%) and least frequently observed with ChatGPT-4 (n = 1, 2.1%). Overall, no statistically significant difference in concordant recommendations was observed across LLMs (P = .12). Of all recommendations, only 20 (10.4%) were transparent and provided references with full bibliographic details or links to specific peer-reviewed content to support recommendations.

CONCLUSIONS

Among leading commercially available LLMs, more than 1-in-4 recommendations concerning the evaluation and management of rotator cuff and anterior cruciate ligament injuries do not reflect current evidence-based CPGs. Although ChatGPT-4 showed the highest performance, clinically significant rates of recommendations without concordance or supporting evidence were observed. Only 10% of responses by LLMs were transparent, precluding users from fully interpreting the sources from which recommendations were provided.

CLINICAL RELEVANCE

Although leading LLMs generally provide recommendations concordant with CPGs, a substantial error rate exists, and the proportion of recommendations that do not align with these CPGs suggests that LLMs are not trustworthy clinical support tools at this time. Each off-the-shelf, closed-source LLM has strengths and weaknesses. Future research should evaluate and compare multiple LLMs to avoid bias associated with narrow evaluation of few models as observed in the current literature.

摘要

目的

确定几款领先的、商业可用的大语言模型(LLMs)是否能提供与美国骨科医师学会(AAOS)制定的循证临床实践指南(CPGs)相一致的治疗建议。

方法

从AAOS中提取了所有关于肩袖撕裂(n = 33)和前交叉韧带损伤(n = 15)管理的CPGs。Chat生成式预训练变换器版本4(ChatGPT-4)、Gemini、Mistral-7B和Claude-3的治疗建议由2名盲法医生根据与AAOS CPGs的一致性程度进行分级,分为一致、不一致或不确定(即无明确建议的中性回答)。对LLM与AAOS建议之间的总体一致性进行量化,并通过Fisher精确检验评估4个LLM之间建议的比较总体一致性。

结果

总体而言,135条回答(70.3%)是一致的,43条(22.4%)是不确定的,14条(7.3%)是不一致的。一致性分类的评分者间信度极佳(κ = 0.92)。与AAOS CPGs的一致性在ChatGPT-4中最常观察到(n = 38,79.2%),在Mistral-7B中最不常观察到(n = 28,58.3%)。不确定的建议在Mistral-7B中最常观察到(n = 17,35.4%),在Claude-3中最不常观察到(n = 8,6.7%)。不一致的建议在Gemini中最常观察到(n = 6,12.5%),在ChatGPT-4中最不常观察到(n = 1,2.1%)。总体而言,各LLM之间在一致建议方面未观察到统计学显著差异(P = 0.12)。在所有建议中,只有20条(10.4%)是透明的,并提供了带有完整书目细节的参考文献或指向特定同行评审内容的链接以支持建议。

结论

在领先的商业可用LLMs中,超过四分之一的关于肩袖和前交叉韧带损伤评估与管理的建议未反映当前的循证CPGs。尽管ChatGPT-4表现最佳,但仍观察到有临床意义的不一致或无支持证据的建议率。LLMs的回答中只有10%是透明的,这使得用户无法充分解读建议的来源。

临床相关性

尽管领先的LLMs通常提供与CPGs一致的建议,但仍存在相当高的错误率,且与这些CPGs不一致的建议比例表明,目前LLMs并非可靠的临床支持工具。每个现成的、闭源的LLM都有优缺点。未来的研究应评估和比较多个LLM,以避免当前文献中因对少数模型进行狭隘评估而产生的偏差。

相似文献

1
Currently Available Large Language Models Do Not Provide Musculoskeletal Treatment Recommendations That Are Concordant With Evidence-Based Clinical Practice Guidelines.目前可用的大语言模型并未提供与循证临床实践指南相一致的肌肉骨骼治疗建议。
Arthroscopy. 2025 Feb;41(2):263-275.e6. doi: 10.1016/j.arthro.2024.07.040. Epub 2024 Aug 22.
2
ChatGPT and Gemini Are Not Consistently Concordant With the 2020 American Academy of Orthopaedic Surgeons Clinical Practice Guidelines When Evaluating Rotator Cuff Injury.在评估肩袖损伤时,ChatGPT和Gemini与2020年美国矫形外科医师学会临床实践指南的结果并非始终一致。
Arthroscopy. 2025 Feb 4. doi: 10.1016/j.arthro.2025.01.039.
3
Chat Generative Pretrained Transformer (ChatGPT) and Bard: Artificial Intelligence Does not yet Provide Clinically Supported Answers for Hip and Knee Osteoarthritis.聊天生成预训练转换器(ChatGPT)和巴德:人工智能尚未为髋和膝关节骨关节炎提供临床支持的答案。
J Arthroplasty. 2024 May;39(5):1184-1190. doi: 10.1016/j.arth.2024.01.029. Epub 2024 Jan 17.
4
"Dr. AI Will See You Now": How Do ChatGPT-4 Treatment Recommendations Align With Orthopaedic Clinical Practice Guidelines?“AI 医生为您服务”:ChatGPT-4 的治疗建议与骨科临床实践指南如何契合?
Clin Orthop Relat Res. 2024 Dec 1;482(12):2098-2106. doi: 10.1097/CORR.0000000000003234. Epub 2024 Sep 6.
5
Do ChatGPT and Gemini Provide Appropriate Recommendations for Pediatric Orthopaedic Conditions?ChatGPT和Gemini是否能为小儿骨科疾病提供恰当的建议?
J Pediatr Orthop. 2025 Jan 1;45(1):e66-e71. doi: 10.1097/BPO.0000000000002797. Epub 2024 Aug 22.
6
Artificial Intelligence Large Language Models Address Anterior Cruciate Ligament Reconstruction: Superior Clarity and Completeness by Gemini Compared With ChatGPT-4 in Response to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines.人工智能大语言模型助力前交叉韧带重建:与ChatGPT-4相比,Gemini在回应美国矫形外科医师学会临床实践指南时具有更高的清晰度和完整性。
Arthroscopy. 2025 Jun;41(6):2002-2008. doi: 10.1016/j.arthro.2024.09.020. Epub 2024 Sep 21.
7
Pediatric Supracondylar Humerus and Diaphyseal Femur Fractures: A Comparative Analysis of Chat Generative Pretrained Transformer and Google Gemini Recommendations Versus American Academy of Orthopaedic Surgeons Clinical Practice Guidelines.小儿肱骨髁上骨折和股骨干骨折:Chat生成式预训练变换器与谷歌Gemini建议对比美国矫形外科医师学会临床实践指南的分析
J Pediatr Orthop. 2025 Apr 1;45(4):e338-e344. doi: 10.1097/BPO.0000000000002890. Epub 2025 Jan 14.
8
Performance of Artificial Intelligence in Addressing Questions Regarding Management of Osteochondritis Dissecans.人工智能在解决剥脱性骨软骨炎管理相关问题中的表现。
Sports Health. 2025 Apr 1:19417381251326549. doi: 10.1177/19417381251326549.
9
Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.评估大语言模型在与《乳腺影像报告和数据系统》第5版相关问题上的文本和视觉诊断能力。
Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.
10
Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.评估生成式 AI 大语言模型 ChatGPT、Google Bard 和 Microsoft Bing Chat 在支持循证牙科方面的性能:比较混合方法研究。
J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.

引用本文的文献

1
ChatGPT-4o is Not a Reliable Study Source for Orthopaedic Surgery Residents.ChatGPT-4o并非骨科住院医师可靠的学习资源。
JB JS Open Access. 2025 Sep 11;10(3). doi: 10.2106/JBJS.OA.25.00112. eCollection 2025 Jul-Sep.
2
Improving ChatGPT's Performance in Orthopedics: Opportunities Using the CRISPE Framework.提升ChatGPT在骨科领域的表现:运用CRISPE框架的机遇
JOSPT Methods. 2025 Jun;1(2):56-60. doi: 10.2519/josptmethods.2025.0151. Epub 2025 Apr 28.
3
Evaluating the Performance of State-of-the-Art Artificial Intelligence Chatbots Based on the WHO Global Guidelines for the Prevention of Surgical Site Infection: Cross-Sectional Study.
基于世界卫生组织预防手术部位感染全球指南评估最先进的人工智能聊天机器人的性能:横断面研究
J Med Internet Res. 2025 Jul 31;27:e75567. doi: 10.2196/75567.
4
Large language models provide discordant information compared to ophthalmology guidelines.与眼科指南相比,大语言模型提供的信息不一致。
Sci Rep. 2025 Jul 1;15(1):20556. doi: 10.1038/s41598-025-06404-z.
5
Answering real-world clinical questions using large language model, retrieval-augmented generation, and agentic systems.使用大语言模型、检索增强生成和智能体系统回答现实世界的临床问题。
Digit Health. 2025 Jun 9;11:20552076251348850. doi: 10.1177/20552076251348850. eCollection 2025 Jan-Dec.
6
Evaluating Large Language Models for Preoperative Patient Education in Superior Capsular Reconstruction: Comparative Study of Claude, GPT, and Gemini.评估大语言模型在肩胛下肌上囊重建术前患者教育中的应用:Claude、GPT和Gemini的比较研究
JMIR Perioper Med. 2025 Jun 12;8:e70047. doi: 10.2196/70047.
7
The Role of Artificial Intelligence Large Language Models in Personalized Rehabilitation Programs for Knee Osteoarthritis: An Observational Study.人工智能大语言模型在膝关节骨关节炎个性化康复计划中的作用:一项观察性研究。
J Med Syst. 2025 Jun 3;49(1):73. doi: 10.1007/s10916-025-02207-x.
8
Assessment of ChatGPT's adherence to evidence-based clinical practice guidelines for plantar fasciitis management.评估ChatGPT对足底筋膜炎治疗循证临床实践指南的遵循情况。
J Orthop Surg Res. 2025 Apr 30;20(1):434. doi: 10.1186/s13018-025-05831-y.
9
Generative Artificial Intelligence and Musculoskeletal Health Care.生成式人工智能与肌肉骨骼医疗保健
HSS J. 2025 Apr 26:15563316251335334. doi: 10.1177/15563316251335334.
10
Comparative evaluation of artificial intelligence models GPT-4 and GPT-3.5 in clinical decision-making in sports surgery and physiotherapy: a cross-sectional study.人工智能模型GPT-4和GPT-3.5在运动外科和物理治疗临床决策中的比较评估:一项横断面研究。
BMC Med Inform Decis Mak. 2025 Apr 14;25(1):163. doi: 10.1186/s12911-025-02996-8.