From the Department of Radiology and Imaging Sciences, Imaging Biomarkers and Computer-Aided Diagnosis Laboratory, NIH Clinical Center, 10 Center Dr, Bldg 10, Rm 1C224D, Bethesda, MD 20892-1182 (P.M., B.H., A.S., Y.Z., R.M.S.); Walter Reed National Military Medical Center, Bethesda, Md (C.P., N.L., O.S.); Radiologic Associates of Middletown, Middletown, Conn (R.J., K.S.); and Baltimore VA Medical Center, Baltimore, Md (K.C.W.).
Radiology. 2024 Oct;313(1):e240609. doi: 10.1148/radiol.240609.
Background GPT-4V (GPT-4 with vision, ChatGPT; OpenAI) has shown impressive performance in several medical assessments. However, few studies have assessed its performance in interpreting radiologic images. Purpose To assess and compare the accuracy of GPT-4V in assessing radiologic cases with both images and textual context to that of radiologists and residents, to assess if GPT-4V assistance improves human accuracy, and to assess and compare the accuracy of GPT-4V with that of image-only or text-only inputs. Materials and Methods Seventy-two Case of the Day questions at the RSNA 2023 Annual Meeting were curated in this observer study. Answers from GPT-4V were obtained between November 26 and December 10, 2023, with the following inputs for each question: image only, text only, and both text and images. Five radiologists and three residents also answered the questions in an "open book" setting. For the artificial intelligence (AI)-assisted portion, the radiologists and residents were provided with the outputs of GPT-4V. The accuracy of radiologists and residents, both with and without AI assistance, was analyzed using a mixed-effects linear model. The accuracies of GPT-4V with different input combinations were compared by using the McNemar test. < .05 was considered to indicate a significant difference. Results The accuracy of GPT-4V was 43% (31 of 72; 95% CI: 32, 55). Radiologists and residents did not significantly outperform GPT-4V in either imaging-dependent (59% and 56% vs 39%; = .31 and .52, respectively) or imaging-independent (76% and 63% vs 70%; both = .99) cases. With access to GPT-4V responses, there was no evidence of improvement in the average accuracy of the readers. The accuracy obtained by GPT-4V with text-only and image-only inputs was 50% (35 of 70; 95% CI: 39, 61) and 38% (26 of 69; 95% CI: 27, 49), respectively. Conclusion The radiologists and residents did not significantly outperform GPT-4V. Assistance from GPT-4V did not help human raters. GPT-4V relied on the textual context for its outputs. © RSNA, 2024 See also the editorial by Katz in this issue.
背景 GPT-4V(具有视觉功能的 GPT-4 和 ChatGPT;OpenAI)在多项医学评估中表现出令人印象深刻的性能。然而,很少有研究评估其解读放射图像的性能。目的 评估和比较 GPT-4V 在评估带有图像和文本上下文的放射学病例方面的准确性与放射科医生和住院医师的准确性,评估 GPT-4V 辅助是否提高人类准确性,并评估和比较 GPT-4V 与仅图像或仅文本输入的准确性。材料与方法 在这项观察性研究中,对 2023 年 RSNA 年会的 72 个每日病例问题进行了策划。在 2023 年 11 月 26 日至 12 月 10 日期间获得了 GPT-4V 的答案,每个问题的输入如下:仅图像、仅文本和图像加文本。五名放射科医生和三名住院医师也在“开卷”环境下回答了这些问题。对于人工智能 (AI) 辅助部分,放射科医生和住院医师提供了 GPT-4V 的输出。使用混合效应线性模型分析了放射科医生和住院医师在有和没有 AI 辅助的情况下的准确性。通过使用 McNemar 检验比较了 GPT-4V 具有不同输入组合的准确性。<.05 被认为具有统计学意义。结果 GPT-4V 的准确率为 43%(72 例中的 31 例;95%CI:32,55)。在依赖成像(59%和 56%对 39%; =.31 和.52,分别)或不依赖成像(76%和 63%对 70%;两者 =.99)的情况下,放射科医生和住院医师的准确率均未显著优于 GPT-4V。有了 GPT-4V 回答的帮助,读者的平均准确率没有证据表明有所提高。GPT-4V 仅使用文本和仅使用图像输入的准确率分别为 50%(70 例中的 35 例;95%CI:39,61)和 38%(69 例中的 26 例;95%CI:27,49)。结论 放射科医生和住院医师的表现并不明显优于 GPT-4V。GPT-4V 的辅助并没有帮助人类评分者。GPT-4V 的输出依赖于文本上下文。©RSNA,2024 也请参阅本期杂志 Katz 的社论。