Suppr超能文献

腰椎间盘突出症伴神经根病:美国神经外科医师协会(NASS)指南与ChatGPT的比较

Lumbar disc herniation with radiculopathy: a comparison of NASS guidelines and ChatGPT.

作者信息

Kayastha Ankur, Lakshmanan Kirthika, Valentine Michael J, Nguyen Anh, Dholakia Kaushal, Wang Daniel

机构信息

Kansas City University, Kansas City, MO, United States.

MedStar Health, Baltimore, MD, United States.

出版信息

N Am Spine Soc J. 2024 Jun 1;19:100333. doi: 10.1016/j.xnsj.2024.100333. eCollection 2024 Sep.

Abstract

BACKGROUND

ChatGPT is an advanced language AI able to generate responses to clinical questions regarding lumbar disc herniation with radiculopathy. Artificial intelligence (AI) tools are increasingly being considered to assist clinicians in decision-making. This study compared ChatGPT-3.5 and ChatGPT-4.0 responses to established NASS clinical guidelines and evaluated concordance.

METHODS

ChatGPT-3.5 and ChatGPT-4.0 were prompted with fifteen questions from The 2012 NASS Clinical Guidelines for the diagnosis and treatment of lumbar disc herniation with radiculopathy. Clinical questions organized into categories were directly entered as unmodified queries into ChatGPT. Language output was assessed by two independent authors on September 26, 2023 based on operationally-defined parameters of accuracy, over-conclusiveness, supplementary, and incompleteness. ChatGPT-3.5 and ChatGPT-4.0 performance was compared via chi-square analyses.

RESULTS

Among the fifteen responses produced by ChatGPT-3.5, 7 (47%) were accurate, 7 (47%) were over-conclusive, fifteen (100%) were supplementary, and 6 (40%) were incomplete. For ChatGPT-4.0, ten (67%) were accurate, 5 (33%) were over-conclusive, 10 (67%) were supplementary, and 6 (40%) were incomplete. There was a statistically significant difference in supplementary information (100% vs. 67%; p=.014) between ChatGPT-3.5 and ChatGPT-4.0. Accuracy (47% vs. 67%; p=.269), over-conclusiveness (47% vs. 33%; p=.456), and incompleteness (40% vs. 40%; p=1.000) did not show significant differences between ChatGPT-3.5 and ChatGPT-4.0. ChatGPT-3.5 and ChatGPT-4.0 both yielded 100% accuracy for definition and history and physical examination categories. Diagnostic testing yielded 0% accuracy for ChatGPT-3.5 and 100% accuracy for ChatGPT-4.0. Nonsurgical interventions had 50% accuracy for ChatGPT-3.5 and 63% accuracy for ChatGPT-4.0. Surgical interventions resulted in 0% accuracy for ChatGPT-3.5 and 33% accuracy for ChatGPT-4.0.

CONCLUSIONS

ChatGPT-4.0 provided less supplementary information and overall higher accuracy in question categories than ChatGPT-3.5. ChatGPT showed reasonable concordance to NASS guidelines, but clinicians should caution use of ChatGPT in its current state as it fails to safeguard against misinformation.

摘要

背景

ChatGPT是一种先进的语言人工智能,能够生成针对伴有神经根病的腰椎间盘突出症临床问题的回答。人工智能(AI)工具越来越多地被考虑用于协助临床医生进行决策。本研究比较了ChatGPT-3.5和ChatGPT-4.0对已确立的北美脊柱外科学会(NASS)临床指南的回答,并评估了一致性。

方法

根据2012年NASS关于伴有神经根病的腰椎间盘突出症诊断和治疗的临床指南中的15个问题对ChatGPT-3.5和ChatGPT-4.0进行提问。按类别组织的临床问题直接作为未修改的查询输入到ChatGPT中。2023年9月26日,由两名独立作者根据准确性、过度结论性、补充性和不完整性等操作定义的参数对语言输出进行评估。通过卡方分析比较ChatGPT-3.5和ChatGPT-4.0的表现。

结果

在ChatGPT-3.5给出的15个回答中,7个(47%)准确,7个(47%)过度结论性,15个(100%)有补充内容,6个(40%)不完整。对于ChatGPT-4.0,10个(67%)准确,5个(33%)过度结论性,10个(67%)有补充内容,6个(40%)不完整。ChatGPT-3.5和ChatGPT-4.0在补充信息方面存在统计学显著差异(100%对67%;p = 0.014)。准确性(47%对67%;p = 0.269)、过度结论性(47%对33%;p = 0.456)和不完整性(40%对40%;p = 1.000)在ChatGPT-3.5和ChatGPT-4.0之间未显示出显著差异。ChatGPT-3.5和ChatGPT-4.0在定义、病史和体格检查类别上的准确率均为100%。ChatGPT-3.5在诊断测试方面的准确率为0%,ChatGPT-4.0为100%。ChatGPT-3.5在非手术干预方面的准确率为50%,ChatGPT-4.0为63%。ChatGPT-3.5在手术干预方面的准确率为0%,ChatGPT-4.0为33%。

结论

与ChatGPT-3.5相比,ChatGPT-4.0在问题类别中提供的补充信息较少,但总体准确性更高。ChatGPT与NASS指南显示出合理的一致性,但临床医生应谨慎使用当前状态的ChatGPT,因为它无法防范错误信息。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0961/11261487/963828f64c4f/gr1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验