Nwachukwu Benedict U, Varady Nathan H, Allen Answorth A, Dines Joshua S, Altchek David W, Williams Riley J, Kunze Kyle N
Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A.
Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A..
Arthroscopy. 2025 Feb;41(2):263-275.e6. doi: 10.1016/j.arthro.2024.07.040. Epub 2024 Aug 22.
To determine whether several leading, commercially available large language models (LLMs) provide treatment recommendations concordant with evidence-based clinical practice guidelines (CPGs) developed by the American Academy of Orthopaedic Surgeons (AAOS).
All CPGs concerning the management of rotator cuff tears (n = 33) and anterior cruciate ligament injuries (n = 15) were extracted from the AAOS. Treatment recommendations from Chat-Generative Pretrained Transformer version 4 (ChatGPT-4), Gemini, Mistral-7B, and Claude-3 were graded by 2 blinded physicians as being concordant, discordant, or indeterminate (i.e., neutral response without definitive recommendation) with respect to AAOS CPGs. The overall concordance between LLM and AAOS recommendations was quantified, and the comparative overall concordance of recommendations among the 4 LLMs was evaluated through the Fisher exact test.
Overall, 135 responses (70.3%) were concordant, 43 (22.4%) were indeterminate, and 14 (7.3%) were discordant. Inter-rater reliability for concordance classification was excellent (κ = 0.92). Concordance with AAOS CPGs was most frequently observed with ChatGPT-4 (n = 38, 79.2%) and least frequently observed with Mistral-7B (n = 28, 58.3%). Indeterminate recommendations were most frequently observed with Mistral-7B (n = 17, 35.4%) and least frequently observed with Claude-3 (n = 8, 6.7%). Discordant recommendations were most frequently observed with Gemini (n = 6, 12.5%) and least frequently observed with ChatGPT-4 (n = 1, 2.1%). Overall, no statistically significant difference in concordant recommendations was observed across LLMs (P = .12). Of all recommendations, only 20 (10.4%) were transparent and provided references with full bibliographic details or links to specific peer-reviewed content to support recommendations.
Among leading commercially available LLMs, more than 1-in-4 recommendations concerning the evaluation and management of rotator cuff and anterior cruciate ligament injuries do not reflect current evidence-based CPGs. Although ChatGPT-4 showed the highest performance, clinically significant rates of recommendations without concordance or supporting evidence were observed. Only 10% of responses by LLMs were transparent, precluding users from fully interpreting the sources from which recommendations were provided.
Although leading LLMs generally provide recommendations concordant with CPGs, a substantial error rate exists, and the proportion of recommendations that do not align with these CPGs suggests that LLMs are not trustworthy clinical support tools at this time. Each off-the-shelf, closed-source LLM has strengths and weaknesses. Future research should evaluate and compare multiple LLMs to avoid bias associated with narrow evaluation of few models as observed in the current literature.
确定几款领先的、商业可用的大语言模型(LLMs)是否能提供与美国骨科医师学会(AAOS)制定的循证临床实践指南(CPGs)相一致的治疗建议。
从AAOS中提取了所有关于肩袖撕裂(n = 33)和前交叉韧带损伤(n = 15)管理的CPGs。Chat生成式预训练变换器版本4(ChatGPT-4)、Gemini、Mistral-7B和Claude-3的治疗建议由2名盲法医生根据与AAOS CPGs的一致性程度进行分级,分为一致、不一致或不确定(即无明确建议的中性回答)。对LLM与AAOS建议之间的总体一致性进行量化,并通过Fisher精确检验评估4个LLM之间建议的比较总体一致性。
总体而言,135条回答(70.3%)是一致的,43条(22.4%)是不确定的,14条(7.3%)是不一致的。一致性分类的评分者间信度极佳(κ = 0.92)。与AAOS CPGs的一致性在ChatGPT-4中最常观察到(n = 38,79.2%),在Mistral-7B中最不常观察到(n = 28,58.3%)。不确定的建议在Mistral-7B中最常观察到(n = 17,35.4%),在Claude-3中最不常观察到(n = 8,6.7%)。不一致的建议在Gemini中最常观察到(n = 6,12.5%),在ChatGPT-4中最不常观察到(n = 1,2.1%)。总体而言,各LLM之间在一致建议方面未观察到统计学显著差异(P = 0.12)。在所有建议中,只有20条(10.4%)是透明的,并提供了带有完整书目细节的参考文献或指向特定同行评审内容的链接以支持建议。
在领先的商业可用LLMs中,超过四分之一的关于肩袖和前交叉韧带损伤评估与管理的建议未反映当前的循证CPGs。尽管ChatGPT-4表现最佳,但仍观察到有临床意义的不一致或无支持证据的建议率。LLMs的回答中只有10%是透明的,这使得用户无法充分解读建议的来源。
尽管领先的LLMs通常提供与CPGs一致的建议,但仍存在相当高的错误率,且与这些CPGs不一致的建议比例表明,目前LLMs并非可靠的临床支持工具。每个现成的、闭源的LLM都有优缺点。未来的研究应评估和比较多个LLM,以避免当前文献中因对少数模型进行狭隘评估而产生的偏差。