与放射科医生在肌肉骨骼放射学中的诊断表现相比，基于文本与视觉信息的ChatGPT的诊断表现。

ChatGPT's diagnostic performance based on textual vs. visual information compared to radiologists' diagnostic performance in musculoskeletal radiology.

作者信息

Horiuchi Daisuke, Tatekawa Hiroyuki, Oura Tatsushi, Shimono Taro, Walston Shannon L, Takita Hirotaka, Matsushita Shu, Mitsuyama Yasuhito, Miki Yukio, Ueda Daiju

机构信息

Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan.

Department of Artificial Intelligence, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan.

出版信息

Eur Radiol. 2025 Jan;35(1):506-516. doi: 10.1007/s00330-024-10902-5. Epub 2024 Jul 12.

DOI:10.1007/s00330-024-10902-5

PMID:38995378

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11632015/

Abstract

OBJECTIVES

To compare the diagnostic accuracy of Generative Pre-trained Transformer (GPT)-4-based ChatGPT, GPT-4 with vision (GPT-4V) based ChatGPT, and radiologists in musculoskeletal radiology.

MATERIALS AND METHODS

We included 106 "Test Yourself" cases from Skeletal Radiology between January 2014 and September 2023. We input the medical history and imaging findings into GPT-4-based ChatGPT and the medical history and images into GPT-4V-based ChatGPT, then both generated a diagnosis for each case. Two radiologists (a radiology resident and a board-certified radiologist) independently provided diagnoses for all cases. The diagnostic accuracy rates were determined based on the published ground truth. Chi-square tests were performed to compare the diagnostic accuracy of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists.

RESULTS

GPT-4-based ChatGPT significantly outperformed GPT-4V-based ChatGPT (p < 0.001) with accuracy rates of 43% (46/106) and 8% (9/106), respectively. The radiology resident and the board-certified radiologist achieved accuracy rates of 41% (43/106) and 53% (56/106). The diagnostic accuracy of GPT-4-based ChatGPT was comparable to that of the radiology resident, but was lower than that of the board-certified radiologist although the differences were not significant (p = 0.78 and 0.22, respectively). The diagnostic accuracy of GPT-4V-based ChatGPT was significantly lower than those of both radiologists (p < 0.001 and < 0.001, respectively).

CONCLUSION

GPT-4-based ChatGPT demonstrated significantly higher diagnostic accuracy than GPT-4V-based ChatGPT. While GPT-4-based ChatGPT's diagnostic performance was comparable to radiology residents, it did not reach the performance level of board-certified radiologists in musculoskeletal radiology.

CLINICAL RELEVANCE STATEMENT

GPT-4-based ChatGPT outperformed GPT-4V-based ChatGPT and was comparable to radiology residents, but it did not reach the level of board-certified radiologists in musculoskeletal radiology. Radiologists should comprehend ChatGPT's current performance as a diagnostic tool for optimal utilization.

KEY POINTS

This study compared the diagnostic performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists in musculoskeletal radiology. GPT-4-based ChatGPT was comparable to radiology residents, but did not reach the level of board-certified radiologists. When utilizing ChatGPT, it is crucial to input appropriate descriptions of imaging findings rather than the images.

摘要

目的

比较基于生成式预训练变换器（GPT）-4的ChatGPT、基于GPT-4且具备视觉功能（GPT-4V）的ChatGPT以及放射科医生在肌肉骨骼放射学方面的诊断准确性。

材料与方法

我们纳入了2014年1月至2023年9月间《骨骼放射学》中的106个“自我测试”病例。我们将病史和影像检查结果输入基于GPT-4的ChatGPT，将病史和影像输入基于GPT-4V的ChatGPT，然后二者分别针对每个病例生成诊断结果。两位放射科医生（一名放射科住院医师和一名获得委员会认证的放射科医生）独立对所有病例进行诊断。根据已公布的真实诊断结果确定诊断准确率。进行卡方检验以比较基于GPT-4的ChatGPT、基于GPT-4V的ChatGPT和放射科医生的诊断准确性。

结果

基于GPT-4的ChatGPT显著优于基于GPT-4V的ChatGPT（p < 0.001），准确率分别为43%（46/106）和8%（9/106）。放射科住院医师和获得委员会认证的放射科医生的准确率分别为41%（43/106）和53%（56/106）。基于GPT-4的ChatGPT的诊断准确性与放射科住院医师相当，但低于获得委员会认证的放射科医生，尽管差异不显著（分别为p = 0.78和0.22）。基于GPT-4V的ChatGPT的诊断准确性显著低于两位放射科医生（分别为p < 0.001和< 0.001）。

结论

基于GPT-4的ChatGPT的诊断准确性显著高于基于GPT-4V的ChatGPT。虽然基于GPT-4的ChatGPT的诊断性能与放射科住院医师相当，但在肌肉骨骼放射学方面未达到获得委员会认证的放射科医生的性能水平。

临床相关性声明

基于GPT-4的ChatGPT优于基于GPT-4V的ChatGPT，且与放射科住院医师相当，但在肌肉骨骼放射学方面未达到获得委员会认证的放射科医生的水平。放射科医生应了解ChatGPT作为诊断工具的当前性能，以便最佳利用。

关键点

本研究比较了基于GPT-4的ChatGPT、基于GPT-4V的ChatGPT和放射科医生在肌肉骨骼放射学方面的诊断性能。基于GPT-4的ChatGPT与放射科住院医师相当，但未达到获得委员会认证的放射科医生的水平。使用ChatGPT时，输入影像学检查结果的适当描述而非图像至关重要。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/94f7/11632015/fa7ff4656812/330_2024_10902_Fig1_HTML.jpg

相似文献

ChatGPT's diagnostic performance based on textual vs. visual information compared to radiologists' diagnostic performance in musculoskeletal radiology.与放射科医生在肌肉骨骼放射学中的诊断表现相比，基于文本与视觉信息的ChatGPT的诊断表现。

Eur Radiol. 2025 Jan;35(1):506-516. doi: 10.1007/s00330-024-10902-5. Epub 2024 Jul 12.

Comparing the Diagnostic Performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and Radiologists in Challenging Neuroradiology Cases.比较基于 GPT-4 的 ChatGPT、基于 GPT-4V 的 ChatGPT 和放射科医生在神经放射学挑战性病例中的诊断性能。

Clin Neuroradiol. 2024 Dec;34(4):779-787. doi: 10.1007/s00062-024-01426-y. Epub 2024 May 28.

Comparative analysis of GPT-4-based ChatGPT's diagnostic performance with radiologists using real-world radiology reports of brain tumors.基于GPT-4的ChatGPT与放射科医生在使用脑肿瘤真实世界放射学报告方面的诊断性能比较分析。

Eur Radiol. 2025 Apr;35(4):1938-1947. doi: 10.1007/s00330-024-11032-8. Epub 2024 Aug 28.

Evaluation of GPT Large Language Model Performance on RSNA 2023 Case of the Day Questions.评估 GPT 大语言模型在 RSNA 2023 每日病例问题上的表现。

Radiology. 2024 Oct;313(1):e240609. doi: 10.1148/radiol.240609.

Performance of GPT-4 with Vision on Text- and Image-based ACR Diagnostic Radiology In-Training Examination Questions.GPT-4 在基于文本和图像的放射科住院医师诊断考试中的表现。

Radiology. 2024 Sep;312(3):e240153. doi: 10.1148/radiol.240153.

Comparing Diagnostic Accuracy of Radiologists versus GPT-4V and Gemini Pro Vision Using Image Inputs from Diagnosis Please Cases.比较放射科医生与 GPT-4V 和 Gemini Pro Vision 使用诊断请案例的图像输入的诊断准确性。

Radiology. 2024 Jul;312(1):e240273. doi: 10.1148/radiol.240273.

Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study.探讨 ChatGPT 版本 3.5、4 和 4 与 Vision 在智利医师执照考试中的表现：观察性研究。

JMIR Med Educ. 2024 Apr 29;10:e55048. doi: 10.2196/55048.

Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration.评估ChatGPT-4的诊断准确性：视觉数据整合的影响。

JMIR Med Inform. 2024 Apr 9;12:e55627. doi: 10.2196/55627.

Lung Cancer Staging Using Chest CT and FDG PET/CT Free-Text Reports: Comparison Among Three ChatGPT Large Language Models and Six Human Readers of Varying Experience.使用胸部CT和FDG PET/CT自由文本报告进行肺癌分期：三种ChatGPT大语言模型与六位不同经验水平的人类读者的比较

AJR Am J Roentgenol. 2024 Dec;223(6):e2431696. doi: 10.2214/AJR.24.31696. Epub 2024 Sep 4.

A retrospective evaluation of the potential of ChatGPT in the accurate diagnosis of acute stroke.对ChatGPT在急性中风准确诊断中的潜力进行回顾性评估。

Diagn Interv Radiol. 2025 Apr 28;31(3):187-195. doi: 10.4274/dir.2024.242892. Epub 2024 Sep 2.

引用本文的文献

Large Language Models in Medical Diagnostics: Scoping Review With Bibliometric Analysis.医学诊断中的大语言模型：基于文献计量分析的综述

J Med Internet Res. 2025 Jun 9;27:e72062. doi: 10.2196/72062.

The Accuracy of ChatGPT-4o in Interpreting Chest and Abdominal X-Ray Images.ChatGPT-4o 在解读胸部和腹部 X 光图像方面的准确性。

J Pers Med. 2025 May 10;15(5):194. doi: 10.3390/jpm15050194.

Comparative performance of large language models in structuring head CT radiology reports: multi-institutional validation study in Japan.大型语言模型在构建头部CT放射学报告中的比较性能：日本的多机构验证研究

Jpn J Radiol. 2025 May 14. doi: 10.1007/s11604-025-01799-1.

Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis.比较临床专业人员和大语言模型的诊断准确性：系统评价与荟萃分析

JMIR Med Inform. 2025 Apr 25;13:e64963. doi: 10.2196/64963.

A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians.生成式人工智能与医生诊断性能比较的系统评价与荟萃分析

NPJ Digit Med. 2025 Mar 22;8(1):175. doi: 10.1038/s41746-025-01543-z.

Evaluation of radiology residents' reporting skills using large language models: an observational study.使用大语言模型评估放射科住院医师的报告技能：一项观察性研究。

Jpn J Radiol. 2025 Mar 8. doi: 10.1007/s11604-025-01764-y.

Can ChatGPT4-vision identify radiologic progression of multiple sclerosis on brain MRI?ChatGPT4-vision能否识别脑磁共振成像上多发性硬化症的放射学进展？

Eur Radiol Exp. 2025 Jan 15;9(1):9. doi: 10.1186/s41747-024-00547-w.

Artificial intelligence in rheumatology research: what is it good for?风湿病学研究中的人工智能：它有什么用？

RMD Open. 2025 Jan 8;11(1):e004309. doi: 10.1136/rmdopen-2024-004309.

Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.评估大语言模型在与《乳腺影像报告和数据系统》第5版相关问题上的文本和视觉诊断能力。

Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.

本文引用的文献

When vision meets reality: Exploring the clinical applicability of GPT-4 with vision.当愿景照进现实：探索具备视觉能力的GPT-4的临床适用性。

Clin Imaging. 2024 Apr;108:110101. doi: 10.1016/j.clinimag.2024.110101. Epub 2024 Feb 4.

Large Language Models: A Guide for Radiologists.大语言模型：放射科医师指南。

Korean J Radiol. 2024 Feb;25(2):126-133. doi: 10.3348/kjr.2023.0997.

Chatbots and Large Language Models in Radiology: A Practical Primer for Clinical and Research Applications.放射科中的聊天机器人和大型语言模型：临床和研究应用的实用入门指南。

Radiology. 2024 Jan;310(1):e232756. doi: 10.1148/radiol.232756.

Bridging Language and Stylistic Barriers in IR Standardized Reporting: Enhancing Translation and Structure Using ChatGPT-4.信息检索标准化报告中的语言与文体障碍弥合：利用ChatGPT-4提升翻译与结构

J Vasc Interv Radiol. 2024 Mar;35(3):472-475.e1. doi: 10.1016/j.jvir.2023.11.014. Epub 2023 Nov 23.

Accuracy of ChatGPT generated diagnosis from patient's medical history and imaging findings in neuroradiology cases.ChatGPT根据患者病史和影像学检查结果对神经放射学病例进行诊断的准确性。

Neuroradiology. 2024 Jan;66(1):73-79. doi: 10.1007/s00234-023-03252-4. Epub 2023 Nov 23.

Evaluating the reliability of ChatGPT as a tool for imaging test referral: a comparative study with a clinical decision support system.评估 ChatGPT 作为影像检查转诊工具的可靠性：与临床决策支持系统的对比研究。

Eur Radiol. 2024 May;34(5):2826-2837. doi: 10.1007/s00330-023-10230-0. Epub 2023 Oct 13.

ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports.ChatGPT 让医学文献通俗易懂：简化放射学报告的探索性案例研究。

Eur Radiol. 2024 May;34(5):2817-2825. doi: 10.1007/s00330-023-10213-1. Epub 2023 Oct 5.

Evaluating Diagnostic Performance of ChatGPT in Radiology: Delving into Methods.评估ChatGPT在放射学中的诊断性能：深入研究方法。

Radiology. 2023 Sep;308(3):e232082. doi: 10.1148/radiol.232082.

ChatGPT's Diagnostic Performance from Patient History and Imaging Findings on the Diagnosis Please Quizzes.ChatGPT在诊断问答中基于患者病史和影像检查结果的诊断性能。

Radiology. 2023 Jul;308(1):e231040. doi: 10.1148/radiol.231040.

Feasibility of Differential Diagnosis Based on Imaging Patterns Using a Large Language Model.基于成像模式利用大语言模型进行鉴别诊断的可行性

Radiology. 2023 Jul;308(1):e231167. doi: 10.1148/radiol.231167.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

与放射科医生在肌肉骨骼放射学中的诊断表现相比，基于文本与视觉信息的ChatGPT的诊断表现。

ChatGPT's diagnostic performance based on textual vs. visual information compared to radiologists' diagnostic performance in musculoskeletal radiology.

作者信息

机构信息

出版信息

OBJECTIVES

MATERIALS AND METHODS

RESULTS

CONCLUSION

CLINICAL RELEVANCE STATEMENT

KEY POINTS

目的

材料与方法

结果

结论

临床相关性声明

关键点

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献