大语言模型和多模态模型在X射线图像中检测脊柱稳定系统的比较评估

Comparative Evaluation of Large Language and Multimodal Models in Detecting Spinal Stabilization Systems on X-Ray Images.

作者信息

Polis Bartosz, Zawadzka-Fabijan Agnieszka, Fabijan Robert, Kosińska Róża, Nowosławska Emilia, Fabijan Artur

机构信息

Department of Neurosurgery, Polish-Mother's Memorial Hospital Research Institute, 93-338 Lodz, Poland.

Department of Rehabilitation Medicine, Faculty of Health Sciences, Medical University of Lodz, 90-419 Lodz, Poland.

出版信息

J Clin Med. 2025 May 8;14(10):3282. doi: 10.3390/jcm14103282.

DOI:10.3390/jcm14103282

PMID:40429276

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12112668/

Abstract

Open-source AI models are increasingly applied in medical imaging, yet their effectiveness in detecting and classifying spinal stabilization systems remains underexplored. This study compares ChatGPT-4o (a large language model) and BiomedCLIP (a multimodal model) in their analysis of posturographic X-ray images (AP projection) to assess their accuracy in identifying the presence, type (growing vs. non-growing), and specific system (MCGR vs. PSF). A dataset of 270 X-ray images (93 without stabilization, 80 with MCGR, and 97 with PSF) was analyzed manually by neurosurgeons and evaluated using a three-stage AI-based questioning approach. Performance was assessed via classification accuracy, Gwet's Agreement Coefficient (AC1) for inter-rater reliability, and a two-tailed z-test for statistical significance ( < 0.05). The results indicate that GPT-4o demonstrates high accuracy in detecting spinal stabilization systems, achieving near-perfect recognition (97-100%) for the presence or absence of stabilization. However, its consistency is reduced when distinguishing complex growing-rod (MCGR) configurations, with agreement scores dropping significantly (AC1 = 0.32-0.50). In contrast, BiomedCLIP displays greater response consistency (AC1 = 1.00) but struggles with detailed classification, particularly in recognizing PSF (11% accuracy) and MCGR (4.16% accuracy). Sensitivity analysis revealed GPT-4o's superior stability in hierarchical classification tasks, while BiomedCLIP excelled in binary detection but showed performance deterioration as the classification complexity increased. These findings highlight GPT-4o's robustness in clinical AI-assisted diagnostics, particularly for detailed differentiation of spinal stabilization systems, whereas BiomedCLIP's precision may require further optimization to enhance its applicability in complex radiographic evaluations.

摘要

开源人工智能模型在医学成像中的应用越来越广泛，但其在检测和分类脊柱稳定系统方面的有效性仍未得到充分探索。本研究比较了ChatGPT-4o（一种大型语言模型）和BiomedCLIP（一种多模态模型）在分析姿势X线图像（前后位投影）时，评估它们识别脊柱稳定系统的存在、类型（生长型与非生长型）以及特定系统（MCGR与PSF）的准确性。一个包含270张X线图像的数据集（93张无脊柱稳定系统，80张有MCGR，97张有PSF）由神经外科医生进行人工分析，并使用基于人工智能的三阶段提问方法进行评估。通过分类准确率、评估者间可靠性的Gwet一致性系数（AC1）以及用于统计显著性的双尾z检验（<0.05）来评估性能。结果表明，GPT-4o在检测脊柱稳定系统方面具有较高的准确性，对于脊柱稳定系统的存在与否实现了近乎完美的识别（97%-100%）。然而，在区分复杂的生长棒（MCGR）构型时，其一致性有所降低，一致性得分显著下降（AC1 = 0.32-0.50）。相比之下，BiomedCLIP表现出更高的反应一致性（AC1 = 1.00），但在详细分类方面存在困难，尤其是在识别PSF（准确率11%）和MCGR（准确率4.16%）方面。敏感性分析显示，GPT-4o在分层分类任务中具有更好的稳定性，而BiomedCLIP在二元检测方面表现出色，但随着分类复杂性的增加，性能有所下降。这些发现凸显了GPT-4o在临床人工智能辅助诊断中的稳健性，特别是在脊柱稳定系统的详细鉴别方面，而BiomedCLIP的精度可能需要进一步优化，以提高其在复杂放射学评估中的适用性。