Mehmet Saylan, Elmarawany Mohamed Nabil, Harding Ian, Bowey Andrew James, Andrews John, Chan Daniel, Jayasuriya Raveen, Srinivas Shreya, Tomlinson James, Bayley Edward, Grevitt Michael Paul, James Stuart, Jones Alwyn, McCarthy Michael J H
Cardiff University, Cardiff, UK.
North Bristol NHS Trust, Bristol, UK.
Eur Spine J. 2025 Apr 3. doi: 10.1007/s00586-025-08825-w.
The use of artificial intelligence (AI) in spinal surgery is expanding, yet its ability to match the diagnostic and treatment planning accuracy of human surgeons remains unclear. This study aims to compare the performance of AI models-ChatGPT-3.5, ChatGPT-4, and Google Bard-with that of experienced spinal surgeons in controversial spinal scenarios.
A questionnaire comprising 54 questions was presented to ten spinal surgeons on two occasions, four weeks apart, to assess consistency. The same questionnaire was also presented to ChatGPT-3.5, ChatGPT-4, and Google Bard, each generating five responses per question. Responses were analyzed for consistency and agreement with human surgeons using Kappa values. Thematic analysis of AI responses identified common themes and evaluated the depth and accuracy of AI recommendations.
Test-retest reliability among surgeons showed Kappa values from 0.535 to 1.00, indicating moderate to perfect reliability. Inter-rater agreement between surgeons and AI models was generally low, with nonsignificant p-values. Fair agreements were observed between surgeons' second occasion responses and ChatGPT-3.5 (Kappa = 0.24) and ChatGPT-4 (Kappa = 0.27). AI responses were detailed and structured, while surgeons provided more concise answers.
AI large language models are not yet suitable for complex spinal surgery decisions but hold potential for preliminary information gathering and emergency triage. Legal, ethical, and accuracy issues must be addressed before AI can be reliably integrated into clinical practice.
人工智能(AI)在脊柱外科手术中的应用正在不断扩展,但其在诊断和治疗规划准确性方面与人类外科医生相匹配的能力仍不明确。本研究旨在比较人工智能模型ChatGPT-3.5、ChatGPT-4和谷歌巴德(Google Bard)与经验丰富的脊柱外科医生在有争议的脊柱病例中的表现。
向十位脊柱外科医生分两次发放一份包含54个问题的问卷,两次间隔四周,以评估一致性。同样的问卷也发给了ChatGPT-3.5、ChatGPT-4和谷歌巴德,每个模型针对每个问题生成五个回答。使用卡帕值分析回答的一致性以及与人类外科医生的一致性。对人工智能的回答进行主题分析,确定共同主题并评估人工智能建议的深度和准确性。
外科医生的重测信度显示卡帕值在0.535至1.00之间,表明信度为中度到完美。外科医生与人工智能模型之间的评分者间一致性普遍较低,p值无统计学意义。观察到外科医生第二次回答与ChatGPT-3.5(卡帕值=0.24)和ChatGPT-4(卡帕值=0.27)之间有中等程度的一致性。人工智能的回答详细且有条理,而外科医生提供的答案更简洁。
人工智能大语言模型尚不适合用于复杂的脊柱手术决策,但在初步信息收集和紧急分诊方面具有潜力。在人工智能能够可靠地整合到临床实践之前,必须解决法律、伦理和准确性问题。