GPT4LFS（用于腰椎管狭窄症的生成式预训练变压器4全模态模型）：通过大型多模态模型增强腰椎管狭窄症图像分类

GPT4LFS (generative pretrained transformer 4 omni for lumbar foramina stenosis): enhancing lumbar foraminal stenosis image classification through large multimodal models.

作者信息

Yilihamu Elzat Elham-Yilizati, Zeng Fan-Shuo, Shang Jun, Yang Jin-Tao, Zhong Hai, Feng Shi-Qing

机构信息

Orthopedic Research Center of Shandong University & Advanced Medical Research Institute, Shandong University, Jinan 250000, China.

Department of Rehabilitation of the Second Hospital of Shandong University, Cheeloo College of Medicine, Shandong University, Jinan 250000, China.

出版信息

Spine J. 2025 Mar 27. doi: 10.1016/j.spinee.2025.03.011.

BACKGROUND CONTEXT

Lumbar foraminal stenosis (LFS) is a common spinal condition that requires accurate assessment. Current magnetic resonance imaging (MRI) reporting processes are often inefficient, and while deep learning has potential for improvement, challenges in generalization and interpretability limit its diagnostic effectiveness compared to physician expertise.

PURPOSE

The present study aimed to leverage a multimodal large language model to improve the accuracy and efficiency of LFS image classification, thereby enabling rapid and precise automated diagnosis, reducing the dependence on manually annotated data, and enhancing diagnostic efficiency.

STUDY DESIGN/SETTING: Retrospective study conducted from April 2017 to March 2023.

PATIENT SAMPLE

Sagittal T1-weighted MRI data for the lumbar spine were collected from 1,200 patients across 3 medical centers. A total of 810 patient cases were included in the final analysis, with data collected from 7 different MRI devices.

OUTCOME MEASURES

Automated classification of LFS using the multi modal large language model. Accuracy, sensitivity, Specificity and Cohen's Kappa coefficient were calculated.

METHODS

An advanced multimodal fusion framework GPT4LFS was developed with the primary objective of integrating imaging data and natural language descriptions to comprehensively capture the complex LFS features. The model employed a pretrained ConvNeXt as the image processing module for extracting high-dimensional imaging features. Concurrently, medical descriptive texts generated by the multimodal large language model GPT-4o and encoded and feature-extracted using RoBERTa were utilized to optimize the model's contextual understanding capabilities. The Mamba architecture was implemented during the feature fusion stage, effectively integrating imaging and textual features and thereby enhancing the performance of the classification task. Finally, the stability of the model's detection results was validated by evaluating classification task metrics, such as the accuracy, sensitivity, specificity, and Kappa coefficients.

RESULTS

The training set comprised 6,299 images from 635 patients, the internal test set included 820 images from 82 patients, and the external test set was composed of 930 images from 93 patients. The GPT4LFS model demonstrated an overall accuracy of 93.7%, sensitivity of 95.8%, and specificity of 94.5% in the internal test set (Kappa=0.89, 95% confidence interval (CI): 0.84-0.96, p<.001). In the external test set, the overall accuracy was 92.2%, with a sensitivity of 92.2% and a specificity of 97.4% (Kappa=0.88, 95% CI: 0.84-0.89, p<.001). Both the internal and external test sets showed excellent consistency in the model. The code is freely accessible on GitHub at the following repository: https://github.com/ElzatElham/GPT4LFS.

CONCLUSIONS

Using the GPT4LFS model for LFS image categorization demonstrated accuracy and the capacity for feature description at a level commensurate with that of professional clinicians.

背景

腰椎管狭窄症（LFS）是一种常见的脊柱疾病，需要进行准确评估。当前的磁共振成像（MRI）报告流程通常效率低下，虽然深度学习有改进的潜力，但与医生专业知识相比，其在泛化和可解释性方面的挑战限制了其诊断效果。

目的

本研究旨在利用多模态大语言模型提高LFS图像分类的准确性和效率，从而实现快速、精确的自动诊断，减少对人工标注数据的依赖，并提高诊断效率。

研究设计/地点：2017年4月至2023年3月进行的回顾性研究。

患者样本

从3个医疗中心的1200名患者中收集腰椎矢状位T1加权MRI数据。最终分析纳入了810例患者病例，数据来自7台不同的MRI设备。

观察指标

使用多模态大语言模型对LFS进行自动分类。计算准确率、敏感性、特异性和科恩kappa系数。

方法

开发了一种先进的多模态融合框架GPT4LFS，其主要目标是整合成像数据和自然语言描述，以全面捕捉复杂的LFS特征。该模型采用预训练的ConvNeXt作为图像处理模块，用于提取高维成像特征。同时，利用多模态大语言模型GPT-4o生成并经RoBERTa编码和特征提取的医学描述文本，优化模型的上下文理解能力。在特征融合阶段采用Mamba架构，有效整合成像和文本特征，从而提高分类任务的性能。最后，通过评估分类任务指标，如准确率、敏感性、特异性和kappa系数，验证模型检测结果的稳定性。

结果

训练集包括来自635例患者的6299张图像，内部测试集包括来自82例患者的820张图像，外部测试集由来自93例患者的930张图像组成。GPT4LFS模型在内部测试集中的总体准确率为93.7%，敏感性为95.8%，特异性为94.5%（kappa=0.89，95%置信区间（CI）：0.84-0.96，p<0.001）。在外部测试集中，总体准确率为92.2%，敏感性为92.2%，特异性为97.4%（kappa=0.88，9