评估 GPT 大语言模型在 RSNA 2023 每日病例问题上的表现。

Evaluation of GPT Large Language Model Performance on RSNA 2023 Case of the Day Questions.

机构信息

From the Department of Radiology and Imaging Sciences, Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, NIH Clinical Center, 10 Center Dr, Bldg 10, Rm 1C224D, Bethesda, MD 20892-1182 (P.M., B.H., A.S., Y.Z., R.M.S.); Walter Reed National Military Medical Center, Bethesda, Md (C.P., N.L., O.S.); Radiologic Associates of Middletown, Middletown, Conn (R.J., K.S.); and Baltimore VA Medical Center, Baltimore, Md (K.C.W.).

出版信息

Radiology. 2024 Oct;313(1):e240609. doi: 10.1148/radiol.240609.

DOI:10.1148/radiol.240609

PMID:39352277

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11535869/

Abstract

Background GPT-4V (GPT-4 with vision, ChatGPT; OpenAI) has shown impressive performance in several medical assessments. However, few studies have assessed its performance in interpreting radiologic images. Purpose To assess and compare the accuracy of GPT-4V in assessing radiologic cases with both images and textual context to that of radiologists and residents, to assess if GPT-4V assistance improves human accuracy, and to assess and compare the accuracy of GPT-4V with that of image-only or text-only inputs. Materials and Methods Seventy-two Case of the Day questions at the RSNA 2023 Annual Meeting were curated in this observer study. Answers from GPT-4V were obtained between November 26 and December 10, 2023, with the following inputs for each question: image only, text only, and both text and images. Five radiologists and three residents also answered the questions in an "open book" setting. For the artificial intelligence (AI)-assisted portion, the radiologists and residents were provided with the outputs of GPT-4V. The accuracy of radiologists and residents, both with and without AI assistance, was analyzed using a mixed-effects linear model. The accuracies of GPT-4V with different input combinations were compared by using the McNemar test. < .05 was considered to indicate a significant difference. Results The accuracy of GPT-4V was 43% (31 of 72; 95% CI: 32, 55). Radiologists and residents did not significantly outperform GPT-4V in either imaging-dependent (59% and 56% vs 39%; = .31 and .52, respectively) or imaging-independent (76% and 63% vs 70%; both = .99) cases. With access to GPT-4V responses, there was no evidence of improvement in the average accuracy of the readers. The accuracy obtained by GPT-4V with text-only and image-only inputs was 50% (35 of 70; 95% CI: 39, 61) and 38% (26 of 69; 95% CI: 27, 49), respectively. Conclusion The radiologists and residents did not significantly outperform GPT-4V. Assistance from GPT-4V did not help human raters. GPT-4V relied on the textual context for its outputs. © RSNA, 2024 See also the editorial by Katz in this issue.

摘要

背景 GPT-4V（具有视觉功能的 GPT-4 和 ChatGPT；OpenAI）在多项医学评估中表现出令人印象深刻的性能。然而，很少有研究评估其解读放射图像的性能。目的评估和比较 GPT-4V 在评估带有图像和文本上下文的放射学病例方面的准确性与放射科医生和住院医师的准确性，评估 GPT-4V 辅助是否提高人类准确性，并评估和比较 GPT-4V 与仅图像或仅文本输入的准确性。材料与方法在这项观察性研究中，对 2023 年 RSNA 年会的 72 个每日病例问题进行了策划。在 2023 年 11 月 26 日至 12 月 10 日期间获得了 GPT-4V 的答案，每个问题的输入如下：仅图像、仅文本和图像加文本。五名放射科医生和三名住院医师也在“开卷”环境下回答了这些问题。对于人工智能 (AI) 辅助部分，放射科医生和住院医师提供了 GPT-4V 的输出。使用混合效应线性模型分析了放射科医生和住院医师在有和没有 AI 辅助的情况下的准确性。通过使用 McNemar 检验比较了 GPT-4V 具有不同输入组合的准确性。<.05 被认为具有统计学意义。结果 GPT-4V 的准确率为 43%（72 例中的 31 例；95%CI：32，55）。在依赖成像（59%和 56%对 39%； =.31 和.52，分别）或不依赖成像（76%和 63%对 70%；两者 =.99）的情况下，放射科医生和住院医师的准确率均未显著优于 GPT-4V。有了 GPT-4V 回答的帮助，读者的平均准确率没有证据表明有所提高。GPT-4V 仅使用文本和仅使用图像输入的准确率分别为 50%（70 例中的 35 例；95%CI：39，61）和 38%（69 例中的 26 例；95%CI：27，49）。结论放射科医生和住院医师的表现并不明显优于 GPT-4V。GPT-4V 的辅助并没有帮助人类评分者。GPT-4V 的输出依赖于文本上下文。©RSNA，2024 也请参阅本期杂志 Katz 的社论。

相似文献

Evaluation of GPT Large Language Model Performance on RSNA 2023 Case of the Day Questions.评估 GPT 大语言模型在 RSNA 2023 每日病例问题上的表现。

Radiology. 2024 Oct;313(1):e240609. doi: 10.1148/radiol.240609.

Performance of GPT-4 with Vision on Text- and Image-based ACR Diagnostic Radiology In-Training Examination Questions.GPT-4 在基于文本和图像的放射科住院医师诊断考试中的表现。

Radiology. 2024 Sep;312(3):e240153. doi: 10.1148/radiol.240153.

ChatGPT's diagnostic performance based on textual vs. visual information compared to radiologists' diagnostic performance in musculoskeletal radiology.与放射科医生在肌肉骨骼放射学中的诊断表现相比，基于文本与视觉信息的ChatGPT的诊断表现。

Eur Radiol. 2025 Jan;35(1):506-516. doi: 10.1007/s00330-024-10902-5. Epub 2024 Jul 12.

Comparing the Diagnostic Performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and Radiologists in Challenging Neuroradiology Cases.比较基于 GPT-4 的 ChatGPT、基于 GPT-4V 的 ChatGPT 和放射科医生在神经放射学挑战性病例中的诊断性能。

Clin Neuroradiol. 2024 Dec;34(4):779-787. doi: 10.1007/s00062-024-01426-y. Epub 2024 May 28.

Comparing Diagnostic Accuracy of Radiologists versus GPT-4V and Gemini Pro Vision Using Image Inputs from Diagnosis Please Cases.比较放射科医生与 GPT-4V 和 Gemini Pro Vision 使用诊断请案例的图像输入的诊断准确性。

Radiology. 2024 Jul;312(1):e240273. doi: 10.1148/radiol.240273.

Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study.GPT-4V 在回答日本耳鼻喉科学委员会认证考试问题方面的表现：评估研究。

JMIR Med Educ. 2024 Mar 28;10:e57054. doi: 10.2196/57054.

GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination.GPT-4 Turbo with Vision 在日本诊断放射学委员会考试中未能优于仅文本的 GPT-4 Turbo。

Jpn J Radiol. 2024 Aug;42(8):918-926. doi: 10.1007/s11604-024-01561-z. Epub 2024 May 11.

Performance of GPT-4 Vision on kidney pathology exam questions.GPT-4 视觉模型在肾脏病理考题上的表现。

Am J Clin Pathol. 2024 Sep 3;162(3):220-226. doi: 10.1093/ajcp/aqae030.

Evaluating GPT-V4 (GPT-4 with Vision) on Detection of Radiologic Findings on Chest Radiographs.评估 GPT-V4（具有视觉功能的 GPT-4）在检测胸部 X 光片中放射学发现的能力。

Radiology. 2024 May;311(2):e233270. doi: 10.1148/radiol.233270.

Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study.GPT-4V（视觉）在日本国家医师资格考试中的能力：评估研究。

JMIR Med Educ. 2024 Mar 12;10:e54393. doi: 10.2196/54393.

引用本文的文献

One Year On: Assessing Progress of Multimodal Large Language Model Performance on RSNA 2024 Case of the Day Questions.一年之后：评估多模态大语言模型在RSNA 2024每日病例问题上的性能进展。

Radiology. 2025 Aug;316(2):e250617. doi: 10.1148/radiol.250617.

DeepSeek-assisted LI-RADS classification: AI-driven precision in hepatocellular carcinoma diagnosis.DeepSeek辅助的LI-RADS分类：人工智能驱动的肝细胞癌诊断精准度

Int J Surg. 2025 Sep 1;111(9):5970-5979. doi: 10.1097/JS9.0000000000002763. Epub 2025 Jun 20.

Diagnostic performance of multimodal large language models in radiological quiz cases: the effects of prompt engineering and input conditions.多模态大语言模型在放射学问答病例中的诊断性能：提示工程和输入条件的影响

Ultrasonography. 2025 May;44(3):220-231. doi: 10.14366/usg.25012. Epub 2025 Mar 11.

本文引用的文献

Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine.医学领域多模态GPT-4视觉专家级准确性背后的隐藏缺陷。

NPJ Digit Med. 2024 Jul 23;7(1):190. doi: 10.1038/s41746-024-01185-7.

Performance of GPT-4 on the American College of Radiology In-training Examination: Evaluating Accuracy, Model Drift, and Fine-tuning.GPT-4 在美国放射学院实习考试中的表现：评估准确性、模型漂移和微调。

Acad Radiol. 2024 Jul;31(7):3046-3054. doi: 10.1016/j.acra.2024.04.006. Epub 2024 Apr 22.

Comparing GPT-3.5 and GPT-4 Accuracy and Drift in Diagnosis Please Cases.比较GPT-3.5和GPT-4在诊断病例中的准确性和偏差。

Radiology. 2024 Jan;310(1):e232411. doi: 10.1148/radiol.232411.

Evaluation of GPT-4's Chest X-Ray Impression Generation: A Reader Study on Performance and Perception.评估 GPT-4 生成的胸部 X 光印象：一项关于性能和感知的读者研究。

J Med Internet Res. 2023 Dec 22;25:e50865. doi: 10.2196/50865.

Evaluating the performance of Generative Pre-trained Transformer-4 (GPT-4) in standardizing radiology reports.评估生成式预训练变换器4（GPT-4）在规范放射学报告方面的性能。

Eur Radiol. 2024 Jun;34(6):3566-3574. doi: 10.1007/s00330-023-10384-x. Epub 2023 Nov 8.

Performance of ChatGPT, human radiologists, and context-aware ChatGPT in identifying AO codes from radiology reports.ChatGPT、人类放射科医生和上下文感知 ChatGPT 在从放射报告中识别 AO 编码方面的表现。

Sci Rep. 2023 Aug 30;13(1):14215. doi: 10.1038/s41598-023-41512-8.

GPT-4 in Radiology: Improvements in Advanced Reasoning.GPT-4 在放射学中的应用：高级推理能力的提升。

Radiology. 2023 Jun;307(5):e230987. doi: 10.1148/radiol.230987. Epub 2023 May 16.

Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations.ChatGPT 在放射科 Board 考试中的表现：当前优势和局限性的深入了解。

Radiology. 2023 Jun;307(5):e230582. doi: 10.1148/radiol.230582. Epub 2023 May 16.

Automated Assessment of Renal Calculi in Serial Computed Tomography Scans.连续计算机断层扫描中肾结石的自动评估

Appl Med Artif Intell (2022). 2022 Sep;13540:39-48. doi: 10.1007/978-3-031-17721-7_5. Epub 2022 Sep 30.

Leveraging GPT-4 for Post Hoc Transformation of Free-text Radiology Reports into Structured Reporting: A Multilingual Feasibility Study.利用GPT-4将自由文本放射学报告进行事后转换为结构化报告：一项多语言可行性研究。

Radiology. 2023 May;307(4):e230725. doi: 10.1148/radiol.230725. Epub 2023 Apr 4.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

评估 GPT 大语言模型在 RSNA 2023 每日病例问题上的表现。

Evaluation of GPT Large Language Model Performance on RSNA 2023 Case of the Day Questions.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献