评估大型语言模型对手术指南的遵循情况：聊天机器人推荐与北美脊柱学会（NASS）覆盖标准的对比分析

Evaluating the Adherence of Large Language Models to Surgical Guidelines: A Comparative Analysis of Chatbot Recommendations and North American Spine Society (NASS) Coverage Criteria.

作者信息

Sarikonda Advith, Isch Emily, Self Mitchell, Sambangi Abhijeet, Carreras Angeleah, Sivaganesan Ahilan, Harrop Jim, Jallo Jack

机构信息

Department of Neurological Surgery, Thomas Jefferson University, Philadelphia, USA.

Department of General Surgery, Division of Plastic Surgery, Thomas Jefferson University Hospital, Philadelphia, USA.

出版信息

Cureus. 2024 Sep 3;16(9):e68521. doi: 10.7759/cureus.68521. eCollection 2024 Sep.

DOI:10.7759/cureus.68521

PMID:39364514

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11448007/

Abstract

Background There has been a significant increase in cervical fusion procedures, both anterior and posterior, across the United States. Despite this upward trend, limited research exists on adherence to evidence-based medicine (EBM) guidelines for cervical fusion, highlighting a gap between recommended practices and surgeon preferences. Additionally, patients are increasingly utilizing large language models (LLMs) to aid in decision-making. Methodology This observational study evaluated the capacity of four LLMs, namely, Bard, BingAI, ChatGPT-3.5, and ChatGPT-4, to adhere to EBM guidelines, specifically the 2023 North American Spine Society (NASS) cervical fusion guidelines. Ten clinical vignettes were created based on NASS recommendations to determine when fusion was indicated. This novel approach assessed LLM performance in a clinical decision-making context without requiring institutional review board approval, as no human subjects were involved. Results No LLM achieved complete concordance with NASS guidelines, though ChatGPT-4 and Bing Chat exhibited the highest adherence at 60%. Discrepancies were notably observed in scenarios involving head-drop syndrome and pseudoarthrosis, where all LLMs failed to align with NASS recommendations. Additionally, only 25% of LLMs agreed with NASS guidelines for fusion in cases of cervical radiculopathy and as an adjunct to facet cyst resection. Conclusions The study underscores the need for improved LLM training on clinical guidelines and emphasizes the importance of considering the nuances of individual patient cases. While LLMs hold promise for enhancing guideline adherence in cervical fusion decision-making, their current performance indicates a need for further refinement and integration with clinical expertise to ensure optimal patient care. This study contributes to understanding the role of AI in healthcare, advocating for a balanced approach that leverages technological advancements while acknowledging the complexities of surgical decision-making.

摘要

背景

在美国，颈椎融合手术（包括前路和后路）的数量显著增加。尽管有这种上升趋势，但关于颈椎融合遵循循证医学（EBM）指南的研究有限，这凸显了推荐做法与外科医生偏好之间的差距。此外，患者越来越多地利用大语言模型（LLM）来辅助决策。

方法

这项观察性研究评估了四种大语言模型，即Bard、BingAI、ChatGPT - 3.5和ChatGPT - 4，遵循EBM指南的能力，特别是2023年北美脊柱协会（NASS）颈椎融合指南。根据NASS建议创建了10个临床案例，以确定何时需要进行融合。这种新颖的方法在临床决策背景下评估了大语言模型的性能，由于不涉及人类受试者，无需机构审查委员会批准。

结果

没有一个大语言模型与NASS指南完全一致，不过ChatGPT - 4和必应聊天表现出最高的遵循率，为60%。在涉及低头综合征和假关节的情况下，明显观察到差异，所有大语言模型都未能与NASS建议保持一致。此外，在颈椎神经根病病例以及作为小关节囊肿切除辅助手段的融合方面，只有25%的大语言模型与NASS指南一致。

结论

该研究强调了改进大语言模型在临床指南方面培训的必要性，并强调了考虑个体患者病例细微差别的重要性。虽然大语言模型有望在颈椎融合决策中提高对指南的遵循率，但其目前的表现表明需要进一步完善并与临床专业知识相结合，以确保为患者提供最佳护理。这项研究有助于理解人工智能在医疗保健中的作用，倡导一种平衡的方法，即利用技术进步同时承认手术决策的复杂性。

相似文献

Evaluating the Adherence of Large Language Models to Surgical Guidelines: A Comparative Analysis of Chatbot Recommendations and North American Spine Society (NASS) Coverage Criteria.评估大型语言模型对手术指南的遵循情况：聊天机器人推荐与北美脊柱学会（NASS）覆盖标准的对比分析

Cureus. 2024 Sep 3;16(9):e68521. doi: 10.7759/cureus.68521. eCollection 2024 Sep.

ChatGPT versus NASS clinical guidelines for degenerative spondylolisthesis: a comparative analysis.ChatGPT 与 NASS 退行性脊柱滑脱临床指南比较分析。

Eur Spine J. 2024 Nov;33(11):4182-4203. doi: 10.1007/s00586-024-08198-6. Epub 2024 Mar 15.

An analysis of ChatGPT recommendations for the diagnosis and treatment of cervical radiculopathy.对 ChatGPT 推荐的颈神经根病诊断和治疗方案的分析。

J Neurosurg Spine. 2024 Jun 28;41(3):385-395. doi: 10.3171/2024.4.SPINE231148. Print 2024 Sep 1.

Harnessing artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in generating clinician-level bariatric surgery recommendations.利用人工智能在减重手术中的应用：ChatGPT-4、Bing 和 Bard 在生成临床医生水平的减重手术建议方面的比较分析。

Surg Obes Relat Dis. 2024 Jul;20(7):603-608. doi: 10.1016/j.soard.2024.03.011. Epub 2024 Mar 24.

Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study.评估生成式 AI 大语言模型 ChatGPT、Google Bard 和 Microsoft Bing Chat 在支持循证牙科方面的性能：比较混合方法研究。

J Med Internet Res. 2023 Dec 28;25:e51580. doi: 10.2196/51580.

Use of ChatGPT for Determining Clinical and Surgical Treatment of Lumbar Disc Herniation With Radiculopathy: A North American Spine Society Guideline Comparison.使用ChatGPT确定伴神经根病的腰椎间盘突出症的临床和手术治疗：与北美脊柱协会指南的比较

Neurospine. 2024 Mar;21(1):149-158. doi: 10.14245/ns.2347052.526. Epub 2024 Jan 31.

Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study.分诊表现比较：大型语言模型、ChatGPT 和未经训练的急诊医生：一项对比研究。

J Med Internet Res. 2024 Jun 14;26:e53297. doi: 10.2196/53297.

Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology.大语言模型（ChatGPT、必应搜索和谷歌巴德）在解决生理学病例 vignettes 中的表现。

Cureus. 2023 Aug 4;15(8):e42972. doi: 10.7759/cureus.42972. eCollection 2023 Aug.

Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.生成式人工智能大语言模型在正畸学中的循证潜力：ChatGPT、谷歌巴德和微软必应的比较研究

Eur J Orthod. 2024 Apr 13. doi: 10.1093/ejo/cjae017.

The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review.大型语言模型在变革急诊医学中的作用：范围综述

JMIR Med Inform. 2024 May 10;12:e53787. doi: 10.2196/53787.

引用本文的文献

Current trends and future prospects of language models and processing systems in spine surgery - a scoping review.脊柱手术中语言模型和处理系统的当前趋势与未来前景——一项范围综述

Neurosurg Rev. 2025 Sep 5;48(1):633. doi: 10.1007/s10143-025-03785-7.

Decoding Immunodeficiencies with Artificial Intelligence: A New Era of Precision Medicine.利用人工智能解码免疫缺陷：精准医学的新时代。

Biomedicines. 2025 Jul 28;13(8):1836. doi: 10.3390/biomedicines13081836.

Evaluation of the performance of large language models in endoscopic lumbar surgery: a comparative analysis.大型语言模型在内镜腰椎手术中的性能评估：一项比较分析。

Ann Med Surg (Lond). 2025 Jun 30;87(8):4835-4840. doi: 10.1097/MS9.0000000000003519. eCollection 2025 Aug.

Evaluating large language model performance to support the diagnosis and management of patients with primary immune disorders.评估大型语言模型的性能以支持原发性免疫疾病患者的诊断和管理。

J Allergy Clin Immunol. 2025 Feb 14. doi: 10.1016/j.jaci.2025.02.004.

Large language models in neurosurgery: a systematic review and meta-analysis.神经外科中的大语言模型：系统评价和荟萃分析。

Acta Neurochir (Wien). 2024 Nov 23;166(1):475. doi: 10.1007/s00701-024-06372-9.

Assessing the Clinical Appropriateness and Practical Utility of ChatGPT as an Educational Resource for Patients Considering Minimally Invasive Spine Surgery.评估ChatGPT作为考虑微创脊柱手术患者的教育资源的临床适用性和实际效用。

Cureus. 2024 Oct 8;16(10):e71105. doi: 10.7759/cureus.71105. eCollection 2024 Oct.

本文引用的文献

Performance of ChatGPT on NASS Clinical Guidelines for the Diagnosis and Treatment of Low Back Pain: A Comparison Study.ChatGPT 在 NASS 腰痛诊断和治疗临床指南中的表现：一项对比研究。

Spine (Phila Pa 1976). 2024 May 1;49(9):640-651. doi: 10.1097/BRS.0000000000004915. Epub 2024 Jan 12.

ChatGPT in academic writing: Maximizing its benefits and minimizing the risks.ChatGPT 在学术写作中的应用：最大化其益处，最小化其风险。

Indian J Ophthalmol. 2023 Dec 1;71(12):3600-3606. doi: 10.4103/IJO.IJO_718_23. Epub 2023 Nov 20.

The Expanding Role of ChatGPT (Chat-Generative Pre-Trained Transformer) in Neurosurgery: A Systematic Review of Literature and Conceptual Framework.ChatGPT（聊天生成预训练变换器）在神经外科中不断扩大的作用：文献系统综述与概念框架

Cureus. 2023 Aug 15;15(8):e43502. doi: 10.7759/cureus.43502. eCollection 2023 Aug.

Thromboembolic prophylaxis in spine surgery: an analysis of ChatGPT recommendations.脊柱手术中的血栓栓塞预防：对ChatGPT推荐意见的分析

Spine J. 2023 Nov;23(11):1684-1691. doi: 10.1016/j.spinee.2023.07.015. Epub 2023 Jul 25.

ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns.ChatGPT在医学教育、研究与实践中的应用：对其前景与合理担忧的系统评价

Healthcare (Basel). 2023 Mar 19;11(6):887. doi: 10.3390/healthcare11060887.

Clinical Outcomes with and without Adherence to Evidence-Based Medicine Guidelines for Lumbar Degenerative Spondylolisthesis Fusion Patients.腰椎退行性滑脱融合患者遵循和未遵循循证医学指南的临床结果

J Clin Med. 2023 Feb 2;12(3):1200. doi: 10.3390/jcm12031200.

Lumbar Synovial Cysts-Should You Fuse or Not?腰椎滑膜囊肿——是否应该融合？

Neurosurgery. 2023 May 1;92(5):1013-1020. doi: 10.1227/neu.0000000000002314. Epub 2022 Dec 30.

Comorbidities associated with cervical spine degenerative disc disease.与颈椎退行性椎间盘疾病相关的合并症。

J Orthop. 2021 Jul 16;26:98-102. doi: 10.1016/j.jor.2021.07.008. eCollection 2021 Jul-Aug.

The Impact of Incorporating Evidence-Based Guidelines for Lumbar Fusion Surgery in Neurosurgical Resident Education.将基于证据的腰椎融合手术指南纳入神经外科住院医师教育的影响。

World Neurosurg. 2021 Oct;154:e382-e388. doi: 10.1016/j.wneu.2021.07.045. Epub 2021 Jul 20.

Are Lumbar Fusion Guidelines Followed? A Survey of North American Spine Surgeons.腰椎融合指南是否得到遵循？北美脊柱外科医生的一项调查。

Neurospine. 2021 Jun;18(2):389-396. doi: 10.14245/ns.2142136.068. Epub 2021 Jun 30.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验