Icahn School of Medicine at Mount Sinai, New York, NY, USA.
Chicago Medical School at Rosalind Franklin University, North Chicago, IL, USA.
Eur Spine J. 2024 Nov;33(11):4182-4203. doi: 10.1007/s00586-024-08198-6. Epub 2024 Mar 15.
BACKGROUND CONTEXT: Clinical guidelines, developed in concordance with the literature, are often used to guide surgeons' clinical decision making. Recent advancements of large language models and artificial intelligence (AI) in the medical field come with exciting potential. OpenAI's generative AI model, known as ChatGPT, can quickly synthesize information and generate responses grounded in medical literature, which may prove to be a useful tool in clinical decision-making for spine care. The current literature has yet to investigate the ability of ChatGPT to assist clinical decision making with regard to degenerative spondylolisthesis. PURPOSE: The study aimed to compare ChatGPT's concordance with the recommendations set forth by The North American Spine Society (NASS) Clinical Guideline for the Diagnosis and Treatment of Degenerative Spondylolisthesis and assess ChatGPT's accuracy within the context of the most recent literature. METHODS: ChatGPT-3.5 and 4.0 was prompted with questions from the NASS Clinical Guideline for the Diagnosis and Treatment of Degenerative Spondylolisthesis and graded its recommendations as "concordant" or "nonconcordant" relative to those put forth by NASS. A response was considered "concordant" when ChatGPT generated a recommendation that accurately reproduced all major points made in the NASS recommendation. Any responses with a grading of "nonconcordant" were further stratified into two subcategories: "Insufficient" or "Over-conclusive," to provide further insight into grading rationale. Responses between GPT-3.5 and 4.0 were compared using Chi-squared tests. RESULTS: ChatGPT-3.5 answered 13 of NASS's 28 total clinical questions in concordance with NASS's guidelines (46.4%). Categorical breakdown is as follows: Definitions and Natural History (1/1, 100%), Diagnosis and Imaging (1/4, 25%), Outcome Measures for Medical Intervention and Surgical Treatment (0/1, 0%), Medical and Interventional Treatment (4/6, 66.7%), Surgical Treatment (7/14, 50%), and Value of Spine Care (0/2, 0%). When NASS indicated there was sufficient evidence to offer a clear recommendation, ChatGPT-3.5 generated a concordant response 66.7% of the time (6/9). However, ChatGPT-3.5's concordance dropped to 36.8% when asked clinical questions that NASS did not provide a clear recommendation on (7/19). A further breakdown of ChatGPT-3.5's nonconcordance with the guidelines revealed that a vast majority of its inaccurate recommendations were due to them being "over-conclusive" (12/15, 80%), rather than "insufficient" (3/15, 20%). ChatGPT-4.0 answered 19 (67.9%) of the 28 total questions in concordance with NASS guidelines (P = 0.177). When NASS indicated there was sufficient evidence to offer a clear recommendation, ChatGPT-4.0 generated a concordant response 66.7% of the time (6/9). ChatGPT-4.0's concordance held up at 68.4% when asked clinical questions that NASS did not provide a clear recommendation on (13/19, P = 0.104). CONCLUSIONS: This study sheds light on the duality of LLM applications within clinical settings: one of accuracy and utility in some contexts versus inaccuracy and risk in others. ChatGPT was concordant for most clinical questions NASS offered recommendations for. However, for questions NASS did not offer best practices, ChatGPT generated answers that were either too general or inconsistent with the literature, and even fabricated data/citations. Thus, clinicians should exercise extreme caution when attempting to consult ChatGPT for clinical recommendations, taking care to ensure its reliability within the context of recent literature.
背景:临床指南是根据文献制定的,常用于指导外科医生的临床决策。最近,大型语言模型和人工智能(AI)在医学领域的发展带来了令人兴奋的潜力。OpenAI 的生成式 AI 模型 ChatGPT 可以快速综合信息并生成基于医学文献的回复,这可能成为脊柱护理临床决策的有用工具。目前的文献尚未探讨 ChatGPT 协助退行性脊椎滑脱诊断和治疗的临床决策的能力。
目的:本研究旨在比较 ChatGPT 与北美脊柱学会(NASS)退行性脊椎滑脱诊断和治疗临床指南的建议的一致性,并评估 ChatGPT 在最新文献背景下的准确性。
方法:ChatGPT-3.5 和 4.0 被提示了 NASS 退行性脊椎滑脱诊断和治疗临床指南中的问题,并根据 NASS 的建议将其建议评为“一致”或“不一致”。当 ChatGPT 生成的建议准确复制了 NASS 建议中的所有主要观点时,该回复被认为是“一致”的。任何评分“不一致”的回复都进一步细分为“不充分”或“过度结论”两个子类别,以提供对评分原理的进一步洞察。使用卡方检验比较 GPT-3.5 和 4.0 之间的回复。
结果:ChatGPT-3.5 与 NASS 的 28 个临床问题中的 13 个问题一致(46.4%)。分类如下:定义和自然史(1/1,100%)、诊断和影像学(1/4,25%)、医学干预和手术治疗的结果测量(0/1,0%)、医学和介入治疗(4/6,66.7%)、手术治疗(7/14,50%)和脊柱护理的价值(0/2,0%)。当 NASS 表示有足够的证据提供明确的建议时,ChatGPT-3.5 以 66.7%的概率生成一致的回复(6/9)。然而,当被问及 NASS 没有提供明确建议的临床问题时,ChatGPT-3.5 的一致性降至 36.8%(7/19)。ChatGPT-3.5 与指南不一致的进一步细分表明,其不准确建议的绝大多数是由于它们“过度结论”(12/15,80%),而不是“不充分”(3/15,20%)。ChatGPT-4.0 与 NASS 指南的 28 个问题中的 19 个(67.9%)一致(P=0.177)。当 NASS 表示有足够的证据提供明确的建议时,ChatGPT-4.0 以 66.7%的概率生成一致的回复(6/9)。当被问及 NASS 没有提供明确建议的临床问题时,ChatGPT-4.0 的一致性保持在 68.4%(13/19,P=0.104)。
结论:本研究揭示了大型语言模型在临床环境中的应用的双重性:在某些情况下准确性和实用性,而在其他情况下则是不准确性和风险。ChatGPT 与 NASS 提供建议的大多数临床问题一致。然而,对于 NASS 没有提供最佳实践的问题,ChatGPT 生成的答案要么过于笼统,要么与文献不一致,甚至编造数据/引用。因此,临床医生在尝试使用 ChatGPT 获得临床建议时应格外小心,务必确保其在最新文献背景下的可靠性。
Spine (Phila Pa 1976). 2024-5-1
J Neurosurg Spine. 2024-9-1
HSS J. 2025-5-29
BMC Musculoskelet Disord. 2025-4-24
N Am Spine Soc J. 2025-2-22
Laryngoscope Investig Otolaryngol. 2025-3-22
N Am Spine Soc J. 2024-12-27
Radiology. 2023-4
Bone Joint J. 2020-8
Neurosurg Clin N Am. 2019-7
Neurosurg Clin N Am. 2019-7