Suppr超能文献

视觉语言模型在日本放射诊断学、核医学和介入放射学专业委员会考试中的诊断准确性。

Diagnostic accuracy of vision-language models on Japanese diagnostic radiology, nuclear medicine, and interventional radiology specialty board examinations.

机构信息

Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3, Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan.

Department of Nuclear Medicine, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan.

出版信息

Jpn J Radiol. 2024 Dec;42(12):1392-1398. doi: 10.1007/s11604-024-01633-0. Epub 2024 Jul 20.

Abstract

PURPOSE

The performance of vision-language models (VLMs) with image interpretation capabilities, such as GPT-4 omni (GPT-4o), GPT-4 vision (GPT-4V), and Claude-3, has not been compared and remains unexplored in specialized radiological fields, including nuclear medicine and interventional radiology. This study aimed to evaluate and compare the diagnostic accuracy of various VLMs, including GPT-4 + GPT-4V, GPT-4o, Claude-3 Sonnet, and Claude-3 Opus, using Japanese diagnostic radiology, nuclear medicine, and interventional radiology (JDR, JNM, and JIR, respectively) board certification tests.

MATERIALS AND METHODS

In total, 383 questions from the JDR test (358 images), 300 from the JNM test (92 images), and 322 from the JIR test (96 images) from 2019 to 2023 were consecutively collected. The accuracy rates of the GPT-4 + GPT-4V, GPT-4o, Claude-3 Sonnet, and Claude-3 Opus were calculated for all questions or questions with images. The accuracy rates of the VLMs were compared using McNemar's test.

RESULTS

GPT-4o demonstrated the highest accuracy rates across all evaluations with the JDR (all questions, 49%; questions with images, 48%), JNM (all questions, 64%; questions with images, 59%), and JIR tests (all questions, 43%; questions with images, 34%), followed by Claude-3 Opus with the JDR (all questions, 40%; questions with images, 38%), JNM (all questions, 42%; questions with images, 43%), and JIR tests (all questions, 40%; questions with images, 30%). For all questions, McNemar's test showed that GPT-4o significantly outperformed the other VLMs (all P < 0.007), except for Claude-3 Opus in the JIR test. For questions with images, GPT-4o outperformed the other VLMs in the JDR and JNM tests (all P < 0.001), except Claude-3 Opus in the JNM test.

CONCLUSION

The GPT-4o had the highest success rates for questions with images and all questions from the JDR, JNM, and JIR board certification tests.

摘要

目的

具有图像解释能力的视觉语言模型(VLMs),如 GPT-4 omni(GPT-4o)、GPT-4 视觉(GPT-4V)和 Claude-3,其性能尚未在核医学和介入放射学等专业放射学领域进行比较和探索。本研究旨在使用日本诊断放射学、核医学和介入放射学(JDR、JNM 和 JIR)的 board 认证测试,评估和比较包括 GPT-4 + GPT-4V、GPT-4o、Claude-3 Sonnet 和 Claude-3 Opus 在内的各种 VLMs 的诊断准确性。

材料和方法

连续收集了 2019 年至 2023 年来自 JDR 测试的 383 个问题(358 个图像)、JNM 测试的 300 个问题(92 个图像)和 JIR 测试的 322 个问题(96 个图像)。计算了 GPT-4 + GPT-4V、GPT-4o、Claude-3 Sonnet 和 Claude-3 Opus 在所有问题或有图像问题上的准确率。使用 McNemar 检验比较了 VLMs 的准确率。

结果

GPT-4o 在 JDR(所有问题,49%;有图像问题,48%)、JNM(所有问题,64%;有图像问题,59%)和 JIR 测试(所有问题,43%;有图像问题,34%)中表现出最高的准确率,其次是 Claude-3 Opus 在 JDR(所有问题,40%;有图像问题,38%)、JNM(所有问题,42%;有图像问题,43%)和 JIR 测试(所有问题,40%;有图像问题,30%)中表现出最高的准确率。对于所有问题,McNemar 检验显示 GPT-4o 显著优于其他 VLMs(所有 P<0.007),除了在 JIR 测试中 Claude-3 Opus。对于有图像的问题,GPT-4o 在 JDR 和 JNM 测试中优于其他 VLMs(所有 P<0.001),除了在 JNM 测试中 Claude-3 Opus。

结论

GPT-4o 在 JDR、JNM 和 JIR board 认证测试的有图像问题和所有问题中都有最高的成功率。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4dad/11588758/4947f6997dfd/11604_2024_1633_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验