Suppr超能文献

评估多模态大语言模型在日本诊断放射学委员会考试中的准确性和合法性。

Assessing accuracy and legitimacy of multimodal large language models on Japan Diagnostic Radiology Board Examination.

作者信息

Hirano Yuichiro, Miki Soichiro, Yamagishi Yosuke, Hanaoka Shouhei, Nakao Takahiro, Kikuchi Tomohiro, Nakamura Yuta, Nomura Yukihiro, Yoshikawa Takeharu, Abe Osamu

机构信息

Department of Radiology, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan.

Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan.

出版信息

Jpn J Radiol. 2025 Sep 12. doi: 10.1007/s11604-025-01861-y.

Abstract

PURPOSE

To assess and compare the accuracy and legitimacy of multimodal large language models (LLMs) on the Japan Diagnostic Radiology Board Examination (JDRBE).

MATERIALS AND METHODS

The dataset comprised questions from JDRBE 2021, 2023, and 2024, with ground-truth answers established through consensus among multiple board-certified diagnostic radiologists. Questions without associated images and those lacking unanimous agreement on answers were excluded. Eight LLMs were evaluated: GPT-4 Turbo, GPT-4o, GPT-4.5, GPT-4.1, o3, o4-mini, Claude 3.7 Sonnet, and Gemini 2.5 Pro. Each model was evaluated under two conditions: with inputting images (vision) and without (text-only). Performance differences between the conditions were assessed using McNemar's exact test. Two diagnostic radiologists (with 2 and 18 years of experience) independently rated the legitimacy of responses from four models (GPT-4 Turbo, Claude 3.7 Sonnet, o3, and Gemini 2.5 Pro) using a five-point Likert scale, blinded to model identity. Legitimacy scores were analyzed using Friedman's test, followed by pairwise Wilcoxon signed-rank tests with Holm correction.

RESULTS

The dataset included 233 questions. Under the vision condition, o3 achieved the highest accuracy at 72%, followed by o4-mini (70%) and Gemini 2.5 Pro (70%). Under the text-only condition, o3 topped the list with an accuracy of 67%. Addition of image input significantly improved the accuracy of two models (Gemini 2.5 Pro and GPT-4.5), but not the others. Both o3 and Gemini 2.5 Pro received significantly higher legitimacy scores than GPT-4 Turbo and Claude 3.7 Sonnet from both raters.

CONCLUSION

Recent multimodal LLMs, particularly o3 and Gemini 2.5 Pro, have demonstrated remarkable progress on JDRBE questions, reflecting their rapid evolution in diagnostic radiology. Eight multimodal large language models were evaluated on the Japan Diagnostic Radiology Board Examination. OpenAI's o3 and Google DeepMind's Gemini 2.5 Pro achieved high accuracy rates (72% and 70%) and received good legitimacy scores from human raters, demonstrating steady progress.

摘要

目的

评估并比较多模态大语言模型(LLMs)在日本诊断放射学委员会考试(JDRBE)中的准确性和合理性。

材料与方法

数据集包含2021年、2023年和2024年JDRBE的问题,其真实答案由多名获得委员会认证的诊断放射科医生共同确定。没有相关图像以及对答案未达成一致意见的问题被排除。评估了八个大语言模型:GPT-4 Turbo、GPT-4o、GPT-4.5、GPT-4.1、o3、o4-mini、Claude 3.7 Sonnet和Gemini 2.5 Pro。每个模型在两种条件下进行评估:输入图像(视觉)和不输入图像(仅文本)。使用McNemar精确检验评估两种条件下的性能差异。两名诊断放射科医生(分别有2年和18年经验)使用五点李克特量表,在不知道模型身份的情况下,独立对四个模型(GPT-4 Turbo、Claude 3.7 Sonnet、o3和Gemini 2.5 Pro)的回答合理性进行评分。使用Friedman检验分析合理性得分,随后进行带有Holm校正的成对Wilcoxon符号秩检验。

结果

数据集包括233个问题。在视觉条件下,o3的准确率最高,为72%,其次是o4-mini(70%)和Gemini 2.5 Pro(70%)。在仅文本条件下,o3以67%的准确率位居榜首。添加图像输入显著提高了两个模型(Gemini 2.5 Pro和GPT-4.5)的准确率,但其他模型没有提高。两名评分者对o3和Gemini 2.5 Pro的合理性评分均显著高于GPT-4 Turbo和Claude 3.7 Sonnet。

结论

近期的多模态大语言模型,特别是o3和Gemini 2.5 Pro,在JDRBE问题上取得了显著进展,反映了它们在诊断放射学方面的快速发展。在日本诊断放射学委员会考试中评估了八个多模态大语言模型。OpenAI的o3和谷歌DeepMind的Gemini 2.5 Pro取得了较高的准确率(72%和70%),并获得了人类评分者的良好合理性评分,显示出稳步进展。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验