Aydin Cemre, Duygu Ozden Bedre, Karakas Asli Beril, Er Eda, Gokmen Gokhan, Ozturk Anil Murat, Govsa Figen
Department of Orthopedics and Traumatology, Faculty of Medicine, Ege University, 35040 Izmir, Turkey.
Department of Anatomy, Faculty of Medicine, Bakırcay University, 35660 Izmir, Turkey.
Medicina (Kaunas). 2025 Jul 25;61(8):1342. doi: 10.3390/medicina61081342.
General-purpose multimodal large language models (LLMs) are increasingly used for medical image interpretation despite lacking clinical validation. This study evaluates the diagnostic reliability of ChatGPT-4o and Claude 2 in photographic assessment of adolescent idiopathic scoliosis (AIS) against radiological standards. This study examines two critical questions: whether families can derive reliable preliminary assessments from LLMs through analysis of clinical photographs and whether LLMs exhibit cognitive fidelity in their visuospatial reasoning capabilities for AIS assessment. A prospective diagnostic accuracy study (STARD-compliant) analyzed 97 adolescents (74 with AIS and 23 with postural asymmetry). Standardized clinical photographs (nine views/patient) were assessed by two LLMs and two orthopedic residents against reference radiological measurements. Primary outcomes included diagnostic accuracy (sensitivity/specificity), Cobb angle concordance (Lin's CCC), inter-rater reliability (Cohen's κ), and measurement agreement (Bland-Altman LoA). The LLMs exhibited hazardous diagnostic inaccuracy: ChatGPT misclassified all non-AIS cases (specificity 0% [95% CI: 0.0-14.8]), while Claude 2 generated 78.3% false positives. Systematic measurement errors exceeded clinical tolerance: ChatGPT overestimated thoracic curves by +10.74° (LoA: -21.45° to +42.92°), exceeding tolerance by >800%. Both LLMs showed inverse biomechanical concordance in thoracolumbar curves (CCC ≤ -0.106). Inter-rater reliability fell below random chance (ChatGPT κ = -0.039). Universal proportional bias (slopes ≈ -1.0) caused severe curve underestimation (e.g., 10-15° error for 50° deformities). Human evaluators demonstrated superior bias control (0.3-2.8° vs. 2.6-10.7°) but suboptimal specificity (21.7-26.1%) and hazardous lumbar concordance (CCC: -0.123). General-purpose LLMs demonstrate clinically unacceptable inaccuracy in photographic AIS assessment, contraindicating clinical deployment. Catastrophic false positives, systematic measurement errors exceeding tolerance by 480-1074%, and inverse diagnostic concordance necessitate urgent regulatory safeguards under frameworks like the EU AI Act. Neither LLMs nor photographic human assessment achieve reliability thresholds for standalone screening, mandating domain-specific algorithm development and integration of 3D modalities.
通用多模态大语言模型(LLMs)尽管缺乏临床验证,但越来越多地用于医学图像解读。本研究评估了ChatGPT-4o和Claude 2在青少年特发性脊柱侧凸(AIS)照片评估中相对于放射学标准的诊断可靠性。本研究探讨了两个关键问题:家庭是否可以通过分析临床照片从大语言模型中获得可靠的初步评估,以及大语言模型在AIS评估的视觉空间推理能力方面是否表现出认知保真度。一项前瞻性诊断准确性研究(符合STARD标准)分析了97名青少年(74名患有AIS,23名有姿势不对称)。两个大语言模型和两名骨科住院医师根据参考放射学测量对标准化临床照片(每位患者9个视图)进行了评估。主要结果包括诊断准确性(敏感性/特异性)、Cobb角一致性(Lin's CCC)、评分者间可靠性(Cohen's κ)和测量一致性(Bland-Altman LoA)。大语言模型表现出危险的诊断不准确:ChatGPT将所有非AIS病例误分类(特异性为0%[95%CI:0.0-14.8]),而Claude 2产生了78.3%的假阳性。系统性测量误差超过临床容忍度:ChatGPT将胸弯高估了+10.74°(LoA:-21.45°至+42.92°),超过容忍度>800%。两个大语言模型在胸腰段曲线中均表现出反向生物力学一致性(CCC≤-0.106)。评分者间可靠性低于随机概率(ChatGPT κ=-0.039)。普遍比例偏差(斜率≈-1.0)导致严重的曲线低估(例如,50°畸形有10-15°的误差)。人类评估者表现出更好的偏差控制(0.3-2.8°对2.6-10.7°),但特异性欠佳(21.7-26.1%)且腰椎一致性较差(CCC:-0.123)。通用大语言模型在AIS照片评估中表现出临床上不可接受的不准确,不适合临床应用。灾难性的假阳性、系统性测量误差超过容忍度480-1074%以及反向诊断一致性,需要在欧盟人工智能法案等框架下采取紧急监管保障措施。大语言模型和照片人工评估均未达到独立筛查的可靠性阈值,需要开发特定领域算法并整合3D模态。