Suppr超能文献

多模态大语言模型在日本诊断放射学委员会考试(2021 - 2023年)中的表现

Performance of Multimodal Large Language Models in Japanese Diagnostic Radiology Board Examinations (2021-2023).

作者信息

Nakaura Takeshi, Yoshida Naofumi, Kobayashi Naoki, Nagayama Yasunori, Uetani Hiroyuki, Kidoh Masafumi, Oda Seitaro, Funama Yoshinori, Hirai Toshinori

机构信息

Department of Diagnostic Radiology, Graduate School of Medical Sciences, Kumamoto University, Honjo 1-1-1, Kumamoto 860-8556, Japan (T.N., N.Y., N.K., Y.N., H.U., M.K., S.O., T.H.).

Department of Diagnostic Radiology, Graduate School of Medical Sciences, Kumamoto University, Honjo 1-1-1, Kumamoto 860-8556, Japan (T.N., N.Y., N.K., Y.N., H.U., M.K., S.O., T.H.).

出版信息

Acad Radiol. 2025 May;32(5):2394-2401. doi: 10.1016/j.acra.2024.10.035. Epub 2024 Nov 8.

Abstract

RATIONALE AND OBJECTIVES

To evaluate the performance of various multimodal large language models (LLMs) in the Japanese Diagnostic Radiology Board Examinations (JDRBE) both with and without images.

MATERIALS AND METHODS

Five multimodal LLMs-GPT-4o, Claude 3 Opus, GPT-4 Vision, Gemini Flash 1.5, and Gemini Pro 1.5-were tested using questions from the JDRBE from 2021 to 2023. The models' performances were assessed in two conditions: with images and without images. Accuracy rates were calculated for each model, both overall and within specific subspecialties, including Abdominal and Pelvic Radiology, Musculoskeletal and Breast Imaging, Neuroradiology and Head and Neck Imaging, Nuclear Medicine, and Thoracic and Cardiac Radiology.

RESULTS

The average accuracy rates of the LLMs ranged from 30.21% to 45.00%, with GPT-4o achieving the highest (45.00%). Claude 3 Opus performed best without images (45.83%), while the addition of images did not significantly improve accuracy for any model. Performance varied across subspecialties, with GPT-4o excelling in "Other" (65.63%) and Claude 3 Opus in Neuroradiology and Head and Neck Imaging (55.56%). Importantly, none of the models surpassed the passing threshold of 60%.

CONCLUSION

Our findings demonstrate that multimodal LLMs exhibit a range of accuracy in JDRBE, with GPT-4o and Claude 3 Opus showing the highest overall performance. However, the addition of images did not significantly improve accuracy for any model.

SUMMARY

Multimodal LLMs are a very promising tool in the field of radiology. However, our study shows that while there are some promising results, their ability to evaluate radiological medical images is currently limited. Further development seems necessary before they can be used routinely.

KEY POINTS

Multimodal LLMs show varying accuracy (30.21-45.83%) on Japanese diagnostic radiology board examinations. Adding images did not significantly improve multimodal LLM performance, and significantly decreased accuracy for one model. Performances of multimodal LLMs varied considerably across radiology subspecialties.

摘要

原理与目的

评估各种多模态大语言模型(LLM)在有图像和无图像两种情况下参加日本放射诊断学委员会考试(JDRBE)的表现。

材料与方法

使用2021年至2023年JDRBE的问题对五个多模态LLM——GPT-4o、Claude 3 Opus、GPT-4 Vision、Gemini Flash 1.5和Gemini Pro 1.5进行测试。在有图像和无图像两种条件下评估模型的表现。计算每个模型的总体准确率以及在特定亚专业内的准确率,这些亚专业包括腹部和盆腔放射学、肌肉骨骼和乳腺成像、神经放射学以及头颈部成像、核医学、胸部和心脏放射学。

结果

LLM的平均准确率在30.21%至45.00%之间,GPT-4o的准确率最高(45.00%)。Claude 3 Opus在无图像情况下表现最佳(45.83%),而添加图像后,任何模型的准确率都没有显著提高。各亚专业的表现有所不同,GPT-4o在“其他”方面表现出色(65.63%),Claude 3 Opus在神经放射学和头颈部成像方面表现出色(55.56%)。重要的是,没有一个模型超过60%的及格阈值。

结论

我们的研究结果表明,多模态LLM在JDRBE中表现出不同的准确率,GPT-4o和Claude 3 Opus的总体表现最高。然而,添加图像并没有显著提高任何模型的准确率。

总结

多模态LLM在放射学领域是一个非常有前途的工具。然而,我们的研究表明,虽然有一些有前景的结果,但它们目前评估放射医学图像的能力有限。在它们能够常规使用之前,似乎有必要进一步发展。

要点

多模态LLM在日本放射诊断学委员会考试中表现出不同的准确率(30.21 - 45.83%)。添加图像并没有显著提高多模态LLM的表现,并且有一个模型的准确率显著下降。多模态LLM在放射学亚专业中的表现差异很大。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验