文献检索，用中文搜 PubMed

RATIONALE AND OBJECTIVES

The expansion of large language models to process images offers new avenues for application in radiology. This study aims to assess the multimodal capabilities of contemporary large language models, which allow analysis of image inputs in addition to textual data, on radiology board-style examination questions with images.

MATERIALS AND METHODS

280 questions were retrospectively selected from the AuntMinnie public test bank. The test questions were converted into three formats of prompts; (1) Multimodal, (2) Image-only, and (3) Text-only input. Three models, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet, were evaluated using these prompts. The Cochran Q test and pairwise McNemar test were used to compare performances between prompt formats and models.

RESULTS

No difference was found for the performance in terms of % correct answers between the text, image, and multimodal prompt formats for GPT-4V (54%, 52%, and 57%, respectively; p = .31) and Gemini 1.5 Pro (53%, 54%, and 57%, respectively; p = .53). For Claude 3.5 Sonnet, the image input (48%) significantly underperformed compared to the text input (63%, p < .001) and the multimodal input (66%, p < .001), but no difference was found between the text and multimodal inputs (p = .29). Claude significantly outperformed GPT and Gemini in the text and multimodal formats (p < .01).

CONCLUSION

Vision-capable large language models cannot effectively use images to increase performance on radiology board-style examination questions. When using textual data alone, Claude 3.5 Sonnet outperforms GPT-4V and Gemini 1.5 Pro, highlighting the advancements in the field and its potential for use in further research.

RATIONALE AND OBJECTIVES

MATERIALS AND METHODS

RESULTS

CONCLUSION

原理与目的

大型语言模型扩展至可处理图像，为放射学应用开辟了新途径。本研究旨在评估当代大型语言模型的多模态能力，这些模型除了能处理文本数据外，还能对带有图像的放射学委员会式考试问题进行图像输入分析。

材料与方法

从AuntMinnie公共测试库中回顾性选取280道问题。测试问题被转换为三种格式的提示；（1）多模态，（2）仅图像，（3）仅文本输入。使用这些提示对三个模型GPT-4V、Gemini 1.5 Pro和Claude 3.5 Sonnet进行评估。采用 Cochr an Q检验和两两McNemar检验来比较提示格式和模型之间的性能。

结果

对于GPT-4V（分别为54%、52%和57%；p = 0.31）和Gemini 1.5 Pro（分别为53%、54%和57%；p = 0.53），文本、图像和多模态提示格式在正确答案百分比方面的性能没有差异。对于Claude 3.5 Sonnet，与文本输入（63%，p < 0.001）和多模态输入（66%，p < 0.001）相比，图像输入（48%）表现显著较差，但文本和多模态输入之间没有差异（p = 0.29）。在文本和多模态格式方面，Claude的表现显著优于GPT和Gemini（p < 0.01）。

结论

具备视觉能力的大型语言模型无法有效利用图像来提高放射学委员会式考试问题的答题表现。仅使用文本数据时，Claude 3.5 Sonnet的表现优于GPT-4V和Gemini 1.5 Pro，凸显了该领域的进展及其在进一步研究中的应用潜力。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

具备视觉能力的大语言模型用于诊断放射学委员会考试风格的问题。

Large Language Models with Vision on Diagnostic Radiology Board Exam Style Questions.

作者信息

机构信息

出版信息

RATIONALE AND OBJECTIVES

MATERIALS AND METHODS

RESULTS

CONCLUSION

相似文献

引用本文的文献

具备视觉能力的大语言模型用于诊断放射学委员会考试风格的问题。

Large Language Models with Vision on Diagnostic Radiology Board Exam Style Questions.

作者信息

机构信息

出版信息

RATIONALE AND OBJECTIVES

MATERIALS AND METHODS

RESULTS

CONCLUSION

原理与目的

材料与方法

结果

结论

相似文献

引用本文的文献