• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通用人工智能在脊柱侧弯摄影评估中的临床失败:一项诊断准确性研究。

Clinical Failure of General-Purpose AI in Photographic Scoliosis Assessment: A Diagnostic Accuracy Study.

作者信息

Aydin Cemre, Duygu Ozden Bedre, Karakas Asli Beril, Er Eda, Gokmen Gokhan, Ozturk Anil Murat, Govsa Figen

机构信息

Department of Orthopedics and Traumatology, Faculty of Medicine, Ege University, 35040 Izmir, Turkey.

Department of Anatomy, Faculty of Medicine, Bakırcay University, 35660 Izmir, Turkey.

出版信息

Medicina (Kaunas). 2025 Jul 25;61(8):1342. doi: 10.3390/medicina61081342.

DOI:10.3390/medicina61081342
PMID:40870387
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12387722/
Abstract

General-purpose multimodal large language models (LLMs) are increasingly used for medical image interpretation despite lacking clinical validation. This study evaluates the diagnostic reliability of ChatGPT-4o and Claude 2 in photographic assessment of adolescent idiopathic scoliosis (AIS) against radiological standards. This study examines two critical questions: whether families can derive reliable preliminary assessments from LLMs through analysis of clinical photographs and whether LLMs exhibit cognitive fidelity in their visuospatial reasoning capabilities for AIS assessment. A prospective diagnostic accuracy study (STARD-compliant) analyzed 97 adolescents (74 with AIS and 23 with postural asymmetry). Standardized clinical photographs (nine views/patient) were assessed by two LLMs and two orthopedic residents against reference radiological measurements. Primary outcomes included diagnostic accuracy (sensitivity/specificity), Cobb angle concordance (Lin's CCC), inter-rater reliability (Cohen's κ), and measurement agreement (Bland-Altman LoA). The LLMs exhibited hazardous diagnostic inaccuracy: ChatGPT misclassified all non-AIS cases (specificity 0% [95% CI: 0.0-14.8]), while Claude 2 generated 78.3% false positives. Systematic measurement errors exceeded clinical tolerance: ChatGPT overestimated thoracic curves by +10.74° (LoA: -21.45° to +42.92°), exceeding tolerance by >800%. Both LLMs showed inverse biomechanical concordance in thoracolumbar curves (CCC ≤ -0.106). Inter-rater reliability fell below random chance (ChatGPT κ = -0.039). Universal proportional bias (slopes ≈ -1.0) caused severe curve underestimation (e.g., 10-15° error for 50° deformities). Human evaluators demonstrated superior bias control (0.3-2.8° vs. 2.6-10.7°) but suboptimal specificity (21.7-26.1%) and hazardous lumbar concordance (CCC: -0.123). General-purpose LLMs demonstrate clinically unacceptable inaccuracy in photographic AIS assessment, contraindicating clinical deployment. Catastrophic false positives, systematic measurement errors exceeding tolerance by 480-1074%, and inverse diagnostic concordance necessitate urgent regulatory safeguards under frameworks like the EU AI Act. Neither LLMs nor photographic human assessment achieve reliability thresholds for standalone screening, mandating domain-specific algorithm development and integration of 3D modalities.

摘要

通用多模态大语言模型(LLMs)尽管缺乏临床验证,但越来越多地用于医学图像解读。本研究评估了ChatGPT-4o和Claude 2在青少年特发性脊柱侧凸(AIS)照片评估中相对于放射学标准的诊断可靠性。本研究探讨了两个关键问题:家庭是否可以通过分析临床照片从大语言模型中获得可靠的初步评估,以及大语言模型在AIS评估的视觉空间推理能力方面是否表现出认知保真度。一项前瞻性诊断准确性研究(符合STARD标准)分析了97名青少年(74名患有AIS,23名有姿势不对称)。两个大语言模型和两名骨科住院医师根据参考放射学测量对标准化临床照片(每位患者9个视图)进行了评估。主要结果包括诊断准确性(敏感性/特异性)、Cobb角一致性(Lin's CCC)、评分者间可靠性(Cohen's κ)和测量一致性(Bland-Altman LoA)。大语言模型表现出危险的诊断不准确:ChatGPT将所有非AIS病例误分类(特异性为0%[95%CI:0.0-14.8]),而Claude 2产生了78.3%的假阳性。系统性测量误差超过临床容忍度:ChatGPT将胸弯高估了+10.74°(LoA:-21.45°至+42.92°),超过容忍度>800%。两个大语言模型在胸腰段曲线中均表现出反向生物力学一致性(CCC≤-0.106)。评分者间可靠性低于随机概率(ChatGPT κ=-0.039)。普遍比例偏差(斜率≈-1.0)导致严重的曲线低估(例如,50°畸形有10-15°的误差)。人类评估者表现出更好的偏差控制(0.3-2.8°对2.6-10.7°),但特异性欠佳(21.7-26.1%)且腰椎一致性较差(CCC:-0.123)。通用大语言模型在AIS照片评估中表现出临床上不可接受的不准确,不适合临床应用。灾难性的假阳性、系统性测量误差超过容忍度480-1074%以及反向诊断一致性,需要在欧盟人工智能法案等框架下采取紧急监管保障措施。大语言模型和照片人工评估均未达到独立筛查的可靠性阈值,需要开发特定领域算法并整合3D模态。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/244597743106/medicina-61-01342-g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/3484578327fc/medicina-61-01342-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/5a89dfdacc9a/medicina-61-01342-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/cb1c33559a6b/medicina-61-01342-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/d5c890b05f57/medicina-61-01342-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/a16ad7ac6e36/medicina-61-01342-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/132a2cd907c2/medicina-61-01342-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/3867a27775fe/medicina-61-01342-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/41e5fb45ec4d/medicina-61-01342-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/d265d3c16da0/medicina-61-01342-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/331160689fc7/medicina-61-01342-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/65e55a5210a3/medicina-61-01342-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/244597743106/medicina-61-01342-g012.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/3484578327fc/medicina-61-01342-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/5a89dfdacc9a/medicina-61-01342-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/cb1c33559a6b/medicina-61-01342-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/d5c890b05f57/medicina-61-01342-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/a16ad7ac6e36/medicina-61-01342-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/132a2cd907c2/medicina-61-01342-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/3867a27775fe/medicina-61-01342-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/41e5fb45ec4d/medicina-61-01342-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/d265d3c16da0/medicina-61-01342-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/331160689fc7/medicina-61-01342-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/65e55a5210a3/medicina-61-01342-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2f0e/12387722/244597743106/medicina-61-01342-g012.jpg

相似文献

1
Clinical Failure of General-Purpose AI in Photographic Scoliosis Assessment: A Diagnostic Accuracy Study.通用人工智能在脊柱侧弯摄影评估中的临床失败:一项诊断准确性研究。
Medicina (Kaunas). 2025 Jul 25;61(8):1342. doi: 10.3390/medicina61081342.
2
Use of Artificial Intelligence in Cobb Angle Measurement for Scoliosis: Retrospective Reliability and Accuracy Study of a Mobile App.人工智能在脊柱侧凸 Cobb 角测量中的应用:移动应用程序的回顾性可靠性和准确性研究。
J Med Internet Res. 2024 Nov 1;26:e50631. doi: 10.2196/50631.
3
Classifying Patient Complaints Using Artificial Intelligence-Powered Large Language Models: Cross-Sectional Study.使用人工智能驱动的大语言模型对患者投诉进行分类:横断面研究
J Med Internet Res. 2025 Aug 6;27:e74231. doi: 10.2196/74231.
4
Therapeutic exercises for idiopathic scoliosis in adolescents.青少年特发性脊柱侧凸的治疗性运动。
Cochrane Database Syst Rev. 2024 Feb 28;2(2):CD007837. doi: 10.1002/14651858.CD007837.pub3.
5
Association between trunk aesthetics and underling scoliosis severity and curve type in adolescents: evaluation of traditional clinical scores and novel automated indices from rasterstereographic imaging.青少年躯干美学与潜在脊柱侧弯严重程度及曲线类型之间的关联:传统临床评分与基于光栅立体成像的新型自动指标的评估
Eur J Phys Rehabil Med. 2025 Jun;61(3):532-542. doi: 10.23736/S1973-9087.25.08978-6.
6
Stench of Errors or the Shine of Potential: The Challenge of (Ir)Responsible Use of ChatGPT in Speech-Language Pathology.错误的恶臭还是潜力的光辉:言语病理学中(不)负责任地使用ChatGPT的挑战。
Int J Lang Commun Disord. 2025 Jul-Aug;60(4):e70088. doi: 10.1111/1460-6984.70088.
7
Artificial intelligence for diagnosing exudative age-related macular degeneration.人工智能在渗出性年龄相关性黄斑变性诊断中的应用。
Cochrane Database Syst Rev. 2024 Oct 17;10(10):CD015522. doi: 10.1002/14651858.CD015522.pub2.
8
Braces for idiopathic scoliosis in adolescents.青少年特发性脊柱侧弯的支具
Cochrane Database Syst Rev. 2015 Jun 18;2015(6):CD006850. doi: 10.1002/14651858.CD006850.pub3.
9
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.ChatGPT-4o与四个开源大语言模型基于中国罕见病目录生成诊断的性能:比较研究
J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.
10
Clinical Management of Wasp Stings Using Large Language Models: Cross-Sectional Evaluation Study.使用大语言模型对黄蜂蜇伤进行临床管理:横断面评估研究
J Med Internet Res. 2025 Jun 4;27:e67489. doi: 10.2196/67489.

本文引用的文献

1
Ensemble learning of deep CNN models and two stage level prediction of Cobb angle on surface topography in adolescents with idiopathic scoliosis.基于深度卷积神经网络(CNN)模型的集成学习以及青少年特发性脊柱侧凸患者表面形貌上Cobb角的两阶段水平预测
Med Eng Phys. 2025 Jun;140:104332. doi: 10.1016/j.medengphy.2025.104332. Epub 2025 Apr 9.
2
Responsiveness of trunk asymmetry measurements in clinical photographs after adolescent idiopathic scoliosis surgery.青少年特发性脊柱侧弯手术后临床照片中躯干不对称测量的反应性
Eur Spine J. 2025 May 15. doi: 10.1007/s00586-025-08918-6.
3
Three-dimensional markerless surface topography approach with convolutional neural networks for adolescent idiopathic scoliosis screening.
基于卷积神经网络的三维无标记表面形貌方法用于青少年特发性脊柱侧弯筛查。
Sci Rep. 2025 Mar 13;15(1):8728. doi: 10.1038/s41598-025-92551-2.
4
Current applications and challenges in large language models for patient care: a systematic review.用于患者护理的大语言模型的当前应用与挑战:一项系统综述
Commun Med (Lond). 2025 Jan 21;5(1):26. doi: 10.1038/s43856-024-00717-2.
5
Navigating the integration of large language models in healthcare: challenges, opportunities, and implications under the EU AI Act.应对大型语言模型在医疗保健领域的整合:欧盟人工智能法案下的挑战、机遇与影响
J Anesth Analg Crit Care. 2024 Dec 2;4(1):79. doi: 10.1186/s44158-024-00215-w.
6
Leveraging large language models to construct feedback from medical multiple-choice Questions.利用大型语言模型构建医学选择题的反馈。
Sci Rep. 2024 Nov 13;14(1):27910. doi: 10.1038/s41598-024-79245-x.
7
Is the information provided by large language models valid in educating patients about adolescent idiopathic scoliosis? An evaluation of content, clarity, and empathy : The perspective of the European Spine Study Group.大语言模型提供的信息在对患者进行青少年特发性脊柱侧凸教育方面是否有效?内容、清晰度和同理心的评估:欧洲脊柱研究小组的观点
Spine Deform. 2025 Mar;13(2):361-372. doi: 10.1007/s43390-024-00955-3. Epub 2024 Nov 4.
8
Clinical Significance of Pose Estimation Methods Compared with Radiographic Parameters in Adolescent Patients with Idiopathic Scoliosis.青少年特发性脊柱侧凸患者中姿势估计方法与影像学参数相比的临床意义
Spine Surg Relat Res. 2024 Mar 11;8(5):485-493. doi: 10.22603/ssrr.2023-0269. eCollection 2024 Sep 27.
9
Artificial intelligence's suggestions for level of amputation in diabetic foot ulcers are highly correlated with those of clinicians, only with exception of hindfoot amputations.人工智能在糖尿病足溃疡截肢水平方面的建议与临床医生高度相关,仅在后足截肢方面存在例外。
Int Wound J. 2024 Oct;21(10):e70055. doi: 10.1111/iwj.70055.
10
Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook.医疗保健中的多模态大型语言模型:应用、挑战和未来展望。
J Med Internet Res. 2024 Sep 25;26:e59505. doi: 10.2196/59505.