Polis Bartosz, Zawadzka-Fabijan Agnieszka, Fabijan Robert, Kosińska Róża, Nowosławska Emilia, Fabijan Artur
Department of Neurosurgery, Polish-Mother's Memorial Hospital Research Institute, 93-338 Lodz, Poland.
Department of Rehabilitation Medicine, Faculty of Health Sciences, Medical University of Lodz, 90-419 Lodz, Poland.
J Clin Med. 2025 May 8;14(10):3282. doi: 10.3390/jcm14103282.
Open-source AI models are increasingly applied in medical imaging, yet their effectiveness in detecting and classifying spinal stabilization systems remains underexplored. This study compares ChatGPT-4o (a large language model) and BiomedCLIP (a multimodal model) in their analysis of posturographic X-ray images (AP projection) to assess their accuracy in identifying the presence, type (growing vs. non-growing), and specific system (MCGR vs. PSF). A dataset of 270 X-ray images (93 without stabilization, 80 with MCGR, and 97 with PSF) was analyzed manually by neurosurgeons and evaluated using a three-stage AI-based questioning approach. Performance was assessed via classification accuracy, Gwet's Agreement Coefficient (AC1) for inter-rater reliability, and a two-tailed z-test for statistical significance ( < 0.05). The results indicate that GPT-4o demonstrates high accuracy in detecting spinal stabilization systems, achieving near-perfect recognition (97-100%) for the presence or absence of stabilization. However, its consistency is reduced when distinguishing complex growing-rod (MCGR) configurations, with agreement scores dropping significantly (AC1 = 0.32-0.50). In contrast, BiomedCLIP displays greater response consistency (AC1 = 1.00) but struggles with detailed classification, particularly in recognizing PSF (11% accuracy) and MCGR (4.16% accuracy). Sensitivity analysis revealed GPT-4o's superior stability in hierarchical classification tasks, while BiomedCLIP excelled in binary detection but showed performance deterioration as the classification complexity increased. These findings highlight GPT-4o's robustness in clinical AI-assisted diagnostics, particularly for detailed differentiation of spinal stabilization systems, whereas BiomedCLIP's precision may require further optimization to enhance its applicability in complex radiographic evaluations.
开源人工智能模型在医学成像中的应用越来越广泛,但其在检测和分类脊柱稳定系统方面的有效性仍未得到充分探索。本研究比较了ChatGPT-4o(一种大型语言模型)和BiomedCLIP(一种多模态模型)在分析姿势X线图像(前后位投影)时,评估它们识别脊柱稳定系统的存在、类型(生长型与非生长型)以及特定系统(MCGR与PSF)的准确性。一个包含270张X线图像的数据集(93张无脊柱稳定系统,80张有MCGR,97张有PSF)由神经外科医生进行人工分析,并使用基于人工智能的三阶段提问方法进行评估。通过分类准确率、评估者间可靠性的Gwet一致性系数(AC1)以及用于统计显著性的双尾z检验(<0.05)来评估性能。结果表明,GPT-4o在检测脊柱稳定系统方面具有较高的准确性,对于脊柱稳定系统的存在与否实现了近乎完美的识别(97%-100%)。然而,在区分复杂的生长棒(MCGR)构型时,其一致性有所降低,一致性得分显著下降(AC1 = 0.32-0.50)。相比之下,BiomedCLIP表现出更高的反应一致性(AC1 = 1.00),但在详细分类方面存在困难,尤其是在识别PSF(准确率11%)和MCGR(准确率4.16%)方面。敏感性分析显示,GPT-4o在分层分类任务中具有更好的稳定性,而BiomedCLIP在二元检测方面表现出色,但随着分类复杂性的增加,性能有所下降。这些发现凸显了GPT-4o在临床人工智能辅助诊断中的稳健性,特别是在脊柱稳定系统的详细鉴别方面,而BiomedCLIP的精度可能需要进一步优化,以提高其在复杂放射学评估中的适用性。