Sun Shawn H, Chen Kasha, Anavim Samuel, Phillipi Michael, Yeh Leslie, Huynh Kenneth, Cortes Gillean, Tran Julia, Tran Mark, Yaghmai Vahid, Houshyar Roozbeh
University of California Irvine, Radiology Department, UCI Medical Center, Orange, California, USA.
University of California Irvine, Radiology Department, UCI Medical Center, Orange, California, USA.
Acad Radiol. 2025 May;32(5):3096-3102. doi: 10.1016/j.acra.2024.11.028. Epub 2024 Dec 4.
The expansion of large language models to process images offers new avenues for application in radiology. This study aims to assess the multimodal capabilities of contemporary large language models, which allow analysis of image inputs in addition to textual data, on radiology board-style examination questions with images.
280 questions were retrospectively selected from the AuntMinnie public test bank. The test questions were converted into three formats of prompts; (1) Multimodal, (2) Image-only, and (3) Text-only input. Three models, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet, were evaluated using these prompts. The Cochran Q test and pairwise McNemar test were used to compare performances between prompt formats and models.
No difference was found for the performance in terms of % correct answers between the text, image, and multimodal prompt formats for GPT-4V (54%, 52%, and 57%, respectively; p = .31) and Gemini 1.5 Pro (53%, 54%, and 57%, respectively; p = .53). For Claude 3.5 Sonnet, the image input (48%) significantly underperformed compared to the text input (63%, p < .001) and the multimodal input (66%, p < .001), but no difference was found between the text and multimodal inputs (p = .29). Claude significantly outperformed GPT and Gemini in the text and multimodal formats (p < .01).
Vision-capable large language models cannot effectively use images to increase performance on radiology board-style examination questions. When using textual data alone, Claude 3.5 Sonnet outperforms GPT-4V and Gemini 1.5 Pro, highlighting the advancements in the field and its potential for use in further research.
大型语言模型扩展至可处理图像,为放射学应用开辟了新途径。本研究旨在评估当代大型语言模型的多模态能力,这些模型除了能处理文本数据外,还能对带有图像的放射学委员会式考试问题进行图像输入分析。
从AuntMinnie公共测试库中回顾性选取280道问题。测试问题被转换为三种格式的提示;(1)多模态,(2)仅图像,(3)仅文本输入。使用这些提示对三个模型GPT-4V、Gemini 1.5 Pro和Claude 3.5 Sonnet进行评估。采用 Cochr an Q检验和两两McNemar检验来比较提示格式和模型之间的性能。
对于GPT-4V(分别为54%、52%和57%;p = 0.31)和Gemini 1.5 Pro(分别为53%、54%和57%;p = 0.53),文本、图像和多模态提示格式在正确答案百分比方面的性能没有差异。对于Claude 3.5 Sonnet,与文本输入(63%,p < 0.001)和多模态输入(66%,p < 0.001)相比,图像输入(48%)表现显著较差,但文本和多模态输入之间没有差异(p = 0.29)。在文本和多模态格式方面,Claude的表现显著优于GPT和Gemini(p < 0.01)。
具备视觉能力的大型语言模型无法有效利用图像来提高放射学委员会式考试问题的答题表现。仅使用文本数据时,Claude 3.5 Sonnet的表现优于GPT-4V和Gemini 1.5 Pro,凸显了该领域的进展及其在进一步研究中的应用潜力。