• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估多模态大语言模型在日本诊断放射学委员会考试中的准确性和合法性。

Assessing accuracy and legitimacy of multimodal large language models on Japan Diagnostic Radiology Board Examination.

作者信息

Hirano Yuichiro, Miki Soichiro, Yamagishi Yosuke, Hanaoka Shouhei, Nakao Takahiro, Kikuchi Tomohiro, Nakamura Yuta, Nomura Yukihiro, Yoshikawa Takeharu, Abe Osamu

机构信息

Department of Radiology, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan.

Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan.

出版信息

Jpn J Radiol. 2025 Sep 12. doi: 10.1007/s11604-025-01861-y.

DOI:10.1007/s11604-025-01861-y
PMID:40938561
Abstract

PURPOSE

To assess and compare the accuracy and legitimacy of multimodal large language models (LLMs) on the Japan Diagnostic Radiology Board Examination (JDRBE).

MATERIALS AND METHODS

The dataset comprised questions from JDRBE 2021, 2023, and 2024, with ground-truth answers established through consensus among multiple board-certified diagnostic radiologists. Questions without associated images and those lacking unanimous agreement on answers were excluded. Eight LLMs were evaluated: GPT-4 Turbo, GPT-4o, GPT-4.5, GPT-4.1, o3, o4-mini, Claude 3.7 Sonnet, and Gemini 2.5 Pro. Each model was evaluated under two conditions: with inputting images (vision) and without (text-only). Performance differences between the conditions were assessed using McNemar's exact test. Two diagnostic radiologists (with 2 and 18 years of experience) independently rated the legitimacy of responses from four models (GPT-4 Turbo, Claude 3.7 Sonnet, o3, and Gemini 2.5 Pro) using a five-point Likert scale, blinded to model identity. Legitimacy scores were analyzed using Friedman's test, followed by pairwise Wilcoxon signed-rank tests with Holm correction.

RESULTS

The dataset included 233 questions. Under the vision condition, o3 achieved the highest accuracy at 72%, followed by o4-mini (70%) and Gemini 2.5 Pro (70%). Under the text-only condition, o3 topped the list with an accuracy of 67%. Addition of image input significantly improved the accuracy of two models (Gemini 2.5 Pro and GPT-4.5), but not the others. Both o3 and Gemini 2.5 Pro received significantly higher legitimacy scores than GPT-4 Turbo and Claude 3.7 Sonnet from both raters.

CONCLUSION

Recent multimodal LLMs, particularly o3 and Gemini 2.5 Pro, have demonstrated remarkable progress on JDRBE questions, reflecting their rapid evolution in diagnostic radiology. Eight multimodal large language models were evaluated on the Japan Diagnostic Radiology Board Examination. OpenAI's o3 and Google DeepMind's Gemini 2.5 Pro achieved high accuracy rates (72% and 70%) and received good legitimacy scores from human raters, demonstrating steady progress.

摘要

目的

评估并比较多模态大语言模型(LLMs)在日本诊断放射学委员会考试(JDRBE)中的准确性和合理性。

材料与方法

数据集包含2021年、2023年和2024年JDRBE的问题,其真实答案由多名获得委员会认证的诊断放射科医生共同确定。没有相关图像以及对答案未达成一致意见的问题被排除。评估了八个大语言模型:GPT-4 Turbo、GPT-4o、GPT-4.5、GPT-4.1、o3、o4-mini、Claude 3.7 Sonnet和Gemini 2.5 Pro。每个模型在两种条件下进行评估:输入图像(视觉)和不输入图像(仅文本)。使用McNemar精确检验评估两种条件下的性能差异。两名诊断放射科医生(分别有2年和18年经验)使用五点李克特量表,在不知道模型身份的情况下,独立对四个模型(GPT-4 Turbo、Claude 3.7 Sonnet、o3和Gemini 2.5 Pro)的回答合理性进行评分。使用Friedman检验分析合理性得分,随后进行带有Holm校正的成对Wilcoxon符号秩检验。

结果

数据集包括233个问题。在视觉条件下,o3的准确率最高,为72%,其次是o4-mini(70%)和Gemini 2.5 Pro(70%)。在仅文本条件下,o3以67%的准确率位居榜首。添加图像输入显著提高了两个模型(Gemini 2.5 Pro和GPT-4.5)的准确率,但其他模型没有提高。两名评分者对o3和Gemini 2.5 Pro的合理性评分均显著高于GPT-4 Turbo和Claude 3.7 Sonnet。

结论

近期的多模态大语言模型,特别是o3和Gemini 2.5 Pro,在JDRBE问题上取得了显著进展,反映了它们在诊断放射学方面的快速发展。在日本诊断放射学委员会考试中评估了八个多模态大语言模型。OpenAI的o3和谷歌DeepMind的Gemini 2.5 Pro取得了较高的准确率(72%和70%),并获得了人类评分者的良好合理性评分,显示出稳步进展。

相似文献

1
Assessing accuracy and legitimacy of multimodal large language models on Japan Diagnostic Radiology Board Examination.评估多模态大语言模型在日本诊断放射学委员会考试中的准确性和合法性。
Jpn J Radiol. 2025 Sep 12. doi: 10.1007/s11604-025-01861-y.
2
Accuracy of large language models in generating differential diagnosis from clinical presentation and imaging findings in pediatric cases.大型语言模型根据儿科病例的临床表现和影像学检查结果生成鉴别诊断的准确性。
Pediatr Radiol. 2025 Jul 12. doi: 10.1007/s00247-025-06317-z.
3
GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination.GPT-4 Turbo with Vision 在日本诊断放射学委员会考试中未能优于仅文本的 GPT-4 Turbo。
Jpn J Radiol. 2024 Aug;42(8):918-926. doi: 10.1007/s11604-024-01561-z. Epub 2024 May 11.
4
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响:比较案例研究
JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.
5
Role of Artificial Intelligence in Surgical Training by Assessing GPT-4 and GPT-4o on the Japan Surgical Board Examination With Text-Only and Image-Accompanied Questions: Performance Evaluation Study.通过在日本外科医师资格考试中使用纯文本和图文并茂的问题评估GPT-4和GPT-4o来研究人工智能在外科培训中的作用:性能评估研究
JMIR Med Educ. 2025 Jul 30;11:e69313. doi: 10.2196/69313.
6
One Year On: Assessing Progress of Multimodal Large Language Model Performance on RSNA 2024 Case of the Day Questions.一年之后:评估多模态大语言模型在RSNA 2024每日病例问题上的性能进展。
Radiology. 2025 Aug;316(2):e250617. doi: 10.1148/radiol.250617.
7
Rapidly Benchmarking Large Language Models for Diagnosing Comorbid Patients: Comparative Study Leveraging the LLM-as-a-Judge Method.快速对用于诊断合并症患者的大语言模型进行基准测试:利用“大语言模型即评判者”方法的比较研究
JMIRx Med. 2025 Aug 29;6:e67661. doi: 10.2196/67661.
8
Vision-language model performance on the Japanese Nuclear Medicine Board Examination: high accuracy in text but challenges with image interpretation.视觉语言模型在日本核医学委员会考试中的表现:文本准确率高,但图像解读存在挑战。
Ann Nucl Med. 2025 Jul 15. doi: 10.1007/s12149-025-02084-x.
9
Designing Patient-Centered Communication Aids in Pediatric Surgery Using Large Language Models.使用大语言模型设计儿科手术中以患者为中心的沟通辅助工具
J Pediatr Surg. 2025 Sep 8:162654. doi: 10.1016/j.jpedsurg.2025.162654.
10
Five advanced chatbots solving European Diploma in Radiology (EDiR) text-based questions: differences in performance and consistency.五个解决欧洲放射学文凭(EDiR)基于文本问题的先进聊天机器人:性能和一致性的差异。
Eur Radiol Exp. 2025 Aug 19;9(1):79. doi: 10.1186/s41747-025-00591-0.

本文引用的文献

1
Performance of ChatGPT-4o on the Japanese Medical Licensing Examination: Evalution of Accuracy in Text-Only and Image-Based Questions.ChatGPT-4o在日本医师执照考试中的表现:纯文本和基于图像问题的准确性评估。
JMIR Med Educ. 2024 Dec 24;10:e63129. doi: 10.2196/63129.
2
Performance of ChatGPT and Bard on the medical licensing examinations varies across different cultures: a comparison study.ChatGPT 和 Bard 在医学执照考试中的表现因文化而异:一项比较研究。
BMC Med Educ. 2024 Nov 26;24(1):1372. doi: 10.1186/s12909-024-06309-x.
3
Role of visual information in multimodal large language model performance: an evaluation using the Japanese nuclear medicine board examination.
视觉信息在多模态大语言模型性能中的作用:使用日本核医学委员会考试进行的评估
Ann Nucl Med. 2025 Feb;39(2):217-224. doi: 10.1007/s12149-024-01992-8. Epub 2024 Nov 13.
4
Performance of Multimodal Large Language Models in Japanese Diagnostic Radiology Board Examinations (2021-2023).多模态大语言模型在日本诊断放射学委员会考试(2021 - 2023年)中的表现
Acad Radiol. 2025 May;32(5):2394-2401. doi: 10.1016/j.acra.2024.10.035. Epub 2024 Nov 8.
5
Evaluating GPT-4o's Performance in the Official European Board of Radiology Exam: A Comprehensive Assessment.评估 GPT-4o 在欧洲放射学委员会官方考试中的表现:全面评估。
Acad Radiol. 2024 Nov;31(11):4365-4371. doi: 10.1016/j.acra.2024.09.005. Epub 2024 Sep 18.
6
Diagnostic performances of Claude 3 Opus and Claude 3.5 Sonnet from patient history and key images in Radiology's "Diagnosis Please" cases.Claude 3 Opus 和 Claude 3.5 Sonnet 基于病史和放射科“诊断请”病例关键图像的诊断性能。
Jpn J Radiol. 2024 Dec;42(12):1399-1402. doi: 10.1007/s11604-024-01634-z. Epub 2024 Aug 3.
7
Diagnostic accuracy of vision-language models on Japanese diagnostic radiology, nuclear medicine, and interventional radiology specialty board examinations.视觉语言模型在日本放射诊断学、核医学和介入放射学专业委员会考试中的诊断准确性。
Jpn J Radiol. 2024 Dec;42(12):1392-1398. doi: 10.1007/s11604-024-01633-0. Epub 2024 Jul 20.
8
No improvement found with GPT-4o: results of additional experiments in the Japan Diagnostic Radiology Board Examination.未发现GPT-4o有改进:日本诊断放射学委员会考试中的额外实验结果
Jpn J Radiol. 2024 Nov;42(11):1352-1353. doi: 10.1007/s11604-024-01622-3. Epub 2024 Jun 28.
9
GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination.GPT-4 Turbo with Vision 在日本诊断放射学委员会考试中未能优于仅文本的 GPT-4 Turbo。
Jpn J Radiol. 2024 Aug;42(8):918-926. doi: 10.1007/s11604-024-01561-z. Epub 2024 May 11.
10
The accuracy of large language models in RANZCR's clinical radiology exam sample questions.大语言模型在RANZCR临床放射学考试样题中的准确性。
Jpn J Radiol. 2024 Sep;42(9):1080. doi: 10.1007/s11604-024-01574-8. Epub 2024 Apr 16.