• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

多模态大语言模型在日本诊断放射学委员会考试(2021 - 2023年)中的表现

Performance of Multimodal Large Language Models in Japanese Diagnostic Radiology Board Examinations (2021-2023).

作者信息

Nakaura Takeshi, Yoshida Naofumi, Kobayashi Naoki, Nagayama Yasunori, Uetani Hiroyuki, Kidoh Masafumi, Oda Seitaro, Funama Yoshinori, Hirai Toshinori

机构信息

Department of Diagnostic Radiology, Graduate School of Medical Sciences, Kumamoto University, Honjo 1-1-1, Kumamoto 860-8556, Japan (T.N., N.Y., N.K., Y.N., H.U., M.K., S.O., T.H.).

Department of Diagnostic Radiology, Graduate School of Medical Sciences, Kumamoto University, Honjo 1-1-1, Kumamoto 860-8556, Japan (T.N., N.Y., N.K., Y.N., H.U., M.K., S.O., T.H.).

出版信息

Acad Radiol. 2025 May;32(5):2394-2401. doi: 10.1016/j.acra.2024.10.035. Epub 2024 Nov 8.

DOI:10.1016/j.acra.2024.10.035
PMID:39521632
Abstract

RATIONALE AND OBJECTIVES

To evaluate the performance of various multimodal large language models (LLMs) in the Japanese Diagnostic Radiology Board Examinations (JDRBE) both with and without images.

MATERIALS AND METHODS

Five multimodal LLMs-GPT-4o, Claude 3 Opus, GPT-4 Vision, Gemini Flash 1.5, and Gemini Pro 1.5-were tested using questions from the JDRBE from 2021 to 2023. The models' performances were assessed in two conditions: with images and without images. Accuracy rates were calculated for each model, both overall and within specific subspecialties, including Abdominal and Pelvic Radiology, Musculoskeletal and Breast Imaging, Neuroradiology and Head and Neck Imaging, Nuclear Medicine, and Thoracic and Cardiac Radiology.

RESULTS

The average accuracy rates of the LLMs ranged from 30.21% to 45.00%, with GPT-4o achieving the highest (45.00%). Claude 3 Opus performed best without images (45.83%), while the addition of images did not significantly improve accuracy for any model. Performance varied across subspecialties, with GPT-4o excelling in "Other" (65.63%) and Claude 3 Opus in Neuroradiology and Head and Neck Imaging (55.56%). Importantly, none of the models surpassed the passing threshold of 60%.

CONCLUSION

Our findings demonstrate that multimodal LLMs exhibit a range of accuracy in JDRBE, with GPT-4o and Claude 3 Opus showing the highest overall performance. However, the addition of images did not significantly improve accuracy for any model.

SUMMARY

Multimodal LLMs are a very promising tool in the field of radiology. However, our study shows that while there are some promising results, their ability to evaluate radiological medical images is currently limited. Further development seems necessary before they can be used routinely.

KEY POINTS

Multimodal LLMs show varying accuracy (30.21-45.83%) on Japanese diagnostic radiology board examinations. Adding images did not significantly improve multimodal LLM performance, and significantly decreased accuracy for one model. Performances of multimodal LLMs varied considerably across radiology subspecialties.

摘要

原理与目的

评估各种多模态大语言模型(LLM)在有图像和无图像两种情况下参加日本放射诊断学委员会考试(JDRBE)的表现。

材料与方法

使用2021年至2023年JDRBE的问题对五个多模态LLM——GPT-4o、Claude 3 Opus、GPT-4 Vision、Gemini Flash 1.5和Gemini Pro 1.5进行测试。在有图像和无图像两种条件下评估模型的表现。计算每个模型的总体准确率以及在特定亚专业内的准确率,这些亚专业包括腹部和盆腔放射学、肌肉骨骼和乳腺成像、神经放射学以及头颈部成像、核医学、胸部和心脏放射学。

结果

LLM的平均准确率在30.21%至45.00%之间,GPT-4o的准确率最高(45.00%)。Claude 3 Opus在无图像情况下表现最佳(45.83%),而添加图像后,任何模型的准确率都没有显著提高。各亚专业的表现有所不同,GPT-4o在“其他”方面表现出色(65.63%),Claude 3 Opus在神经放射学和头颈部成像方面表现出色(55.56%)。重要的是,没有一个模型超过60%的及格阈值。

结论

我们的研究结果表明,多模态LLM在JDRBE中表现出不同的准确率,GPT-4o和Claude 3 Opus的总体表现最高。然而,添加图像并没有显著提高任何模型的准确率。

总结

多模态LLM在放射学领域是一个非常有前途的工具。然而,我们的研究表明,虽然有一些有前景的结果,但它们目前评估放射医学图像的能力有限。在它们能够常规使用之前,似乎有必要进一步发展。

要点

多模态LLM在日本放射诊断学委员会考试中表现出不同的准确率(30.21 - 45.83%)。添加图像并没有显著提高多模态LLM的表现,并且有一个模型的准确率显著下降。多模态LLM在放射学亚专业中的表现差异很大。

相似文献

1
Performance of Multimodal Large Language Models in Japanese Diagnostic Radiology Board Examinations (2021-2023).多模态大语言模型在日本诊断放射学委员会考试(2021 - 2023年)中的表现
Acad Radiol. 2025 May;32(5):2394-2401. doi: 10.1016/j.acra.2024.10.035. Epub 2024 Nov 8.
2
Diagnostic accuracy of vision-language models on Japanese diagnostic radiology, nuclear medicine, and interventional radiology specialty board examinations.视觉语言模型在日本放射诊断学、核医学和介入放射学专业委员会考试中的诊断准确性。
Jpn J Radiol. 2024 Dec;42(12):1392-1398. doi: 10.1007/s11604-024-01633-0. Epub 2024 Jul 20.
3
Role of visual information in multimodal large language model performance: an evaluation using the Japanese nuclear medicine board examination.视觉信息在多模态大语言模型性能中的作用:使用日本核医学委员会考试进行的评估
Ann Nucl Med. 2025 Feb;39(2):217-224. doi: 10.1007/s12149-024-01992-8. Epub 2024 Nov 13.
4
Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 edition.评估大语言模型在与《乳腺影像报告和数据系统》第5版相关问题上的文本和视觉诊断能力。
Diagn Interv Radiol. 2025 Mar 3;31(2):111-129. doi: 10.4274/dir.2024.242876. Epub 2024 Sep 9.
5
Diagnostic performances of GPT-4o, Claude 3 Opus, and Gemini 1.5 Pro in "Diagnosis Please" cases.GPT-4o、Claude 3 Opus 和 Gemini 1.5 Pro 在“诊断请”案例中的诊断性能。
Jpn J Radiol. 2024 Nov;42(11):1231-1235. doi: 10.1007/s11604-024-01619-y. Epub 2024 Jul 1.
6
Accuracy and quality of ChatGPT-4o and Google Gemini performance on image-based neurosurgery board questions.ChatGPT-4o和谷歌Gemini在基于图像的神经外科委员会问题上的表现准确性和质量。
Neurosurg Rev. 2025 Mar 25;48(1):320. doi: 10.1007/s10143-025-03472-7.
7
Evaluating the Effectiveness of advanced large language models in medical Knowledge: A Comparative study using Japanese national medical examination.评估先进的大型语言模型在医学知识方面的有效性:使用日本国家医学考试的比较研究。
Int J Med Inform. 2025 Jan;193:105673. doi: 10.1016/j.ijmedinf.2024.105673. Epub 2024 Oct 28.
8
GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination.GPT-4 Turbo with Vision 在日本诊断放射学委员会考试中未能优于仅文本的 GPT-4 Turbo。
Jpn J Radiol. 2024 Aug;42(8):918-926. doi: 10.1007/s11604-024-01561-z. Epub 2024 May 11.
9
Evaluating the reference accuracy of large language models in radiology: a comparative study across subspecialties.评估大型语言模型在放射学中的参考准确性:一项跨亚专业的比较研究。
Diagn Interv Radiol. 2025 May 12. doi: 10.4274/dir.2025.253101.
10
Large Language Models with Vision on Diagnostic Radiology Board Exam Style Questions.具备视觉能力的大语言模型用于诊断放射学委员会考试风格的问题。
Acad Radiol. 2025 May;32(5):3096-3102. doi: 10.1016/j.acra.2024.11.028. Epub 2024 Dec 4.

引用本文的文献

1
Assessing accuracy and legitimacy of multimodal large language models on Japan Diagnostic Radiology Board Examination.评估多模态大语言模型在日本诊断放射学委员会考试中的准确性和合法性。
Jpn J Radiol. 2025 Sep 12. doi: 10.1007/s11604-025-01861-y.