Almekkawi Ahmad K, Caruso James P, Anand Soummitra, Hawkins Angela M, Rauf Rayaan, Al-Shaikhli Mayar, Aoun Salah G, Bagley Carlos A
Saint Luke's Marion Bloch Neuroscience Institute Department of Neurosurgery, Kansas City, Missouri, USA.
The University of Texas Southwestern Department of Neurosurgery, Dallas, Texas, USA.
World Neurosurg. 2025 Feb;194:123531. doi: 10.1016/j.wneu.2024.11.114. Epub 2024 Dec 23.
This study aimed to investigate the accuracy of large language models (LLMs), specifically ChatGPT and Claude, in surgical decision-making and radiological assessment for spine pathologies compared to experienced spine surgeons.
The study employed a comparative analysis between the LLMs and a panel of attending spine surgeons. Five written clinical scenarios encompassing various spine pathologies were presented to the LLMs and surgeons, who provided recommended surgical treatment plans. Additionally, magnetic resonance imaging images depicting spine pathologies were analyzed by the LLMs and surgeons to assess their radiological interpretation abilities. Spino-pelvic parameters were estimated from a scoliosis radiograph by the LLMs.
Qualitative content analysis revealed limitations in the LLMs' consideration of patient-specific factors and the breadth of treatment options. Both ChatGPT and Claude provided detailed descriptions of magnetic resonance imaging findings but differed from the surgeons in terms of specific levels and severity of pathologies. The LLMs acknowledged the limitations of accurately measuring spino-pelvic parameters without specialized tools. The accuracy of surgical decision-making for the LLMs (20%) was lower than that of the attending surgeons (100%). Statistical analysis showed no significant differences in accuracy between the groups.
The study highlights the potential of LLMs in assisting with radiological interpretation and surgical decision-making in spine surgery. However, the current limitations, such as the lack of consideration for patient-specific factors and inaccuracies in treatment recommendations, emphasize the need for further refinement and validation of these artificial intelligence (AI) models. Continued collaboration between AI researchers and clinical experts is crucial to address these challenges and realize the full potential of AI in spine surgery.
本研究旨在调查大语言模型(LLMs),特别是ChatGPT和Claude,在脊柱疾病手术决策和放射学评估方面与经验丰富的脊柱外科医生相比的准确性。
该研究对大语言模型和一组脊柱外科主治医师进行了对比分析。向大语言模型和外科医生呈现了五个包含各种脊柱疾病的书面临床场景,他们提供了推荐的手术治疗方案。此外,大语言模型和外科医生对描绘脊柱疾病的磁共振成像图像进行了分析,以评估他们的放射学解读能力。大语言模型从一张脊柱侧弯X光片中估计了脊柱骨盆参数。
定性内容分析揭示了大语言模型在考虑患者特定因素和治疗选择广度方面的局限性。ChatGPT和Claude都提供了磁共振成像结果的详细描述,但在疾病的具体节段和严重程度方面与外科医生不同。大语言模型承认在没有专门工具的情况下准确测量脊柱骨盆参数存在局限性。大语言模型手术决策的准确性(20%)低于主治医师(100%)。统计分析表明两组之间在准确性上没有显著差异。
该研究突出了大语言模型在脊柱手术的放射学解读和手术决策辅助方面的潜力。然而,当前的局限性,如缺乏对患者特定因素的考虑和治疗建议的不准确,强调了对这些人工智能(AI)模型进行进一步完善和验证的必要性。AI研究人员和临床专家之间的持续合作对于应对这些挑战并实现AI在脊柱手术中的全部潜力至关重要。