GPT-4 Turbo with Vision 在日本诊断放射学委员会考试中未能优于仅文本的 GPT-4 Turbo。

GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination.

机构信息

Department of Radiology, The International University of Health and Welfare Narita Hospital, 852 Hatakeda, Narita, Chiba, Japan.

Department of Radiology, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan.

出版信息

Jpn J Radiol. 2024 Aug;42(8):918-926. doi: 10.1007/s11604-024-01561-z. Epub 2024 May 11.

DOI:10.1007/s11604-024-01561-z

PMID:38733472

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11286662/

Abstract

PURPOSE

To assess the performance of GPT-4 Turbo with Vision (GPT-4TV), OpenAI's latest multimodal large language model, by comparing its ability to process both text and image inputs with that of the text-only GPT-4 Turbo (GPT-4 T) in the context of the Japan Diagnostic Radiology Board Examination (JDRBE).

MATERIALS AND METHODS

The dataset comprised questions from JDRBE 2021 and 2023. A total of six board-certified diagnostic radiologists discussed the questions and provided ground-truth answers by consulting relevant literature as necessary. The following questions were excluded: those lacking associated images, those with no unanimous agreement on answers, and those including images rejected by the OpenAI application programming interface. The inputs for GPT-4TV included both text and images, whereas those for GPT-4 T were entirely text. Both models were deployed on the dataset, and their performance was compared using McNemar's exact test. The radiological credibility of the responses was assessed by two diagnostic radiologists through the assignment of legitimacy scores on a five-point Likert scale. These scores were subsequently used to compare model performance using Wilcoxon's signed-rank test.

RESULTS

The dataset comprised 139 questions. GPT-4TV correctly answered 62 questions (45%), whereas GPT-4 T correctly answered 57 questions (41%). A statistical analysis found no significant performance difference between the two models (P = 0.44). The GPT-4TV responses received significantly lower legitimacy scores from both radiologists than the GPT-4 T responses.

CONCLUSION

No significant enhancement in accuracy was observed when using GPT-4TV with image input compared with that of using text-only GPT-4 T for JDRBE questions.

摘要

目的

通过比较 GPT-4 文字型涡轮机（GPT-4T）和 GPT-4 文字与影像型涡轮机（GPT-4TV）在日本诊断放射线学会考试（JDRBE）中的表现，评估 OpenAI 最新的多模态大型语言模型 GPT-4TV 的性能。

材料与方法

本研究数据集由 JDRBE 2021 年和 2023 年的考题组成。共有六名经董事会认证的放射科医生对考题进行了讨论，并在必要时参考相关文献提供了标准答案。排除了以下问题：那些缺乏相关影像的问题、那些答案没有达成一致的问题以及那些被 OpenAI 应用程序编程接口拒绝的影像问题。GPT-4TV 的输入包括文字和影像，而 GPT-4T 的输入则完全是文字。将这两个模型应用于数据集，并使用 McNemar 精确检验比较它们的性能。通过在五点 Likert 量表上分配合法性得分，两名放射科医生评估了答案的放射学可信度。随后使用 Wilcoxon 符号秩检验比较模型性能。

结果

数据集包含 139 个问题。GPT-4TV 正确回答了 62 个问题（45%），而 GPT-4T 正确回答了 57 个问题（41%）。统计分析发现两个模型之间的性能没有显著差异（P=0.44）。GPT-4TV 的答案得到的合法性评分明显低于 GPT-4T 的答案，两位放射科医生都认为如此。

结论

与使用仅文字的 GPT-4T 相比，在 JDRBE 问题上使用包含影像输入的 GPT-4TV 并没有显著提高准确性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fdf0/11286662/7e6d5b30d652/11604_2024_1561_Fig1_HTML.jpg

相似文献

GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination.GPT-4 Turbo with Vision 在日本诊断放射学委员会考试中未能优于仅文本的 GPT-4 Turbo。

Jpn J Radiol. 2024 Aug;42(8):918-926. doi: 10.1007/s11604-024-01561-z. Epub 2024 May 11.

Performance of Multimodal Large Language Models in Japanese Diagnostic Radiology Board Examinations (2021-2023).多模态大语言模型在日本诊断放射学委员会考试（2021 - 2023年）中的表现

Acad Radiol. 2025 May;32(5):2394-2401. doi: 10.1016/j.acra.2024.10.035. Epub 2024 Nov 8.

Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study.揭示GPT-4V在美国医师执照考试（USMLE）问题上高精度背后的隐藏挑战：观察性研究。

J Med Internet Res. 2025 Feb 7;27:e65146. doi: 10.2196/65146.

Diagnostic accuracy of vision-language models on Japanese diagnostic radiology, nuclear medicine, and interventional radiology specialty board examinations.视觉语言模型在日本放射诊断学、核医学和介入放射学专业委员会考试中的诊断准确性。

Jpn J Radiol. 2024 Dec;42(12):1392-1398. doi: 10.1007/s11604-024-01633-0. Epub 2024 Jul 20.

GPT-4 turbo with vision fails to outperform text-only GPT-4 turbo in the Japan diagnostic radiology board examination: correspondence.在日本诊断放射学委员会考试中，配备视觉功能的GPT-4 turbo未能超越纯文本的GPT-4 turbo：通信。

Jpn J Radiol. 2024 Oct;42(10):1213. doi: 10.1007/s11604-024-01600-9. Epub 2024 May 21.

Large Language Models with Vision on Diagnostic Radiology Board Exam Style Questions.具备视觉能力的大语言模型用于诊断放射学委员会考试风格的问题。

Acad Radiol. 2025 May;32(5):3096-3102. doi: 10.1016/j.acra.2024.11.028. Epub 2024 Dec 4.

Evaluating AI proficiency in nuclear cardiology: Large language models take on the board preparation exam.评估人工智能在核心脏病学方面的熟练程度：大型语言模型参加资格考试。

J Nucl Cardiol. 2025 Mar;45:102089. doi: 10.1016/j.nuclcard.2024.102089. Epub 2024 Nov 29.

Response accuracy of GPT-4 across languages: insights from an expert-level diagnostic radiology examination in Japan.GPT-4在多种语言中的回答准确性：来自日本专家级诊断放射学考试的见解。

Jpn J Radiol. 2025 Feb;43(2):319-329. doi: 10.1007/s11604-024-01673-6. Epub 2024 Oct 28.

Evaluation of GPT Large Language Model Performance on RSNA 2023 Case of the Day Questions.评估 GPT 大语言模型在 RSNA 2023 每日病例问题上的表现。

Radiology. 2024 Oct;313(1):e240609. doi: 10.1148/radiol.240609.

Performance of GPT-4 with Vision on Text- and Image-based ACR Diagnostic Radiology In-Training Examination Questions.GPT-4 在基于文本和图像的放射科住院医师诊断考试中的表现。

Radiology. 2024 Sep;312(3):e240153. doi: 10.1148/radiol.240153.

引用本文的文献

Assessing accuracy and legitimacy of multimodal large language models on Japan Diagnostic Radiology Board Examination.评估多模态大语言模型在日本诊断放射学委员会考试中的准确性和合法性。

Jpn J Radiol. 2025 Sep 12. doi: 10.1007/s11604-025-01861-y.

Intra-axial primary brain tumor differentiation: comparing large language models on structured MRI reports vs. radiologists on images.轴内原发性脑肿瘤鉴别：比较基于结构化MRI报告的大语言模型与阅片放射科医生的表现

Eur Radiol. 2025 Aug 22. doi: 10.1007/s00330-025-11924-3.

Role of Artificial Intelligence in Surgical Training by Assessing GPT-4 and GPT-4o on the Japan Surgical Board Examination With Text-Only and Image-Accompanied Questions: Performance Evaluation Study.通过在日本外科医师资格考试中使用纯文本和图文并茂的问题评估GPT-4和GPT-4o来研究人工智能在外科培训中的作用：性能评估研究

JMIR Med Educ. 2025 Jul 30;11:e69313. doi: 10.2196/69313.

Diagnostic Performance of a Large Language Model for Determining the Cause of Death: A Comparative Analysis of Clinical History, Postmortem Computed Tomography Findings, and Their Integration.用于确定死因的大语言模型的诊断性能：临床病史、尸检计算机断层扫描结果及其整合的比较分析

Cureus. 2025 May 8;17(5):e83721. doi: 10.7759/cureus.83721. eCollection 2025 May.

Evaluating ChatGPT-4's Performance in Identifying Radiological Anatomy in FRCR Part 1 Examination Questions.评估ChatGPT-4在识别FRCR第一部分考试题目中的放射解剖学方面的表现。

Indian J Radiol Imaging. 2024 Nov 4;35(2):287-294. doi: 10.1055/s-0044-1792040. eCollection 2025 Apr.

It is Not Time to Kick Out Radiologists.现在还不是赶走放射科医生的时候。

Asian Bioeth Rev. 2024 Dec 3;17(1):9-15. doi: 10.1007/s41649-024-00325-1. eCollection 2025 Jan.

The critical need for an open medical imaging database in Japan: implications for global health and AI development.日本对开放医学影像数据库的迫切需求：对全球健康和人工智能发展的影响。

Jpn J Radiol. 2025 Apr;43(4):537-541. doi: 10.1007/s11604-024-01716-y. Epub 2024 Dec 13.

J Nucl Cardiol. 2025 Mar;45:102089. doi: 10.1016/j.nuclcard.2024.102089. Epub 2024 Nov 29.

GPT-4 Vision: Multi-Modal Evolution of ChatGPT and Potential Role in Radiology.GPT-4视觉：ChatGPT的多模态演进及其在放射学中的潜在作用。

Cureus. 2024 Aug 31;16(8):e68298. doi: 10.7759/cureus.68298. eCollection 2024 Aug.

Generative AI and large language models in nuclear medicine: current status and future prospects.生成式人工智能和核医学中的大语言模型：现状与未来展望。

Ann Nucl Med. 2024 Nov;38(11):853-864. doi: 10.1007/s12149-024-01981-x. Epub 2024 Sep 25.

本文引用的文献

Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study.GPT-4V（视觉）在日本国家医师资格考试中的能力：评估研究。

JMIR Med Educ. 2024 Mar 12;10:e54393. doi: 10.2196/54393.

Performance of Generative Pretrained Transformer on the National Medical Licensing Examination in Japan.生成式预训练变换器在日本国家医师资格考试中的表现。

PLOS Digit Health. 2024 Jan 23;3(1):e0000433. doi: 10.1371/journal.pdig.0000433. eCollection 2024 Jan.

Five dominant dimensions of brain aging are identified via deep learning: associations with clinical, lifestyle, and genetic measures.通过深度学习确定了大脑衰老的五个主要维度：与临床、生活方式和基因测量的关联。

medRxiv. 2023 Dec 30:2023.12.29.23300642. doi: 10.1101/2023.12.29.23300642.

Adapting Nanopore Sequencing Basecalling Models for Modification Detection via Incremental Learning and Anomaly Detection.通过增量学习和异常检测调整纳米孔测序碱基识别模型以进行修饰检测

bioRxiv. 2023 Dec 20:2023.12.19.572449. doi: 10.1101/2023.12.19.572449.

Normative Modeling of Brain Morphometry Across the Lifespan Using CentileBrain: Algorithm Benchmarking and Model Optimization.使用百分位脑图谱进行全生命周期脑形态计量学的规范建模：算法基准测试与模型优化

bioRxiv. 2023 Dec 2:2023.01.30.523509. doi: 10.1101/2023.01.30.523509.

How does ChatGPT-4 preform on non-English national medical licensing examination? An evaluation in Chinese language.ChatGPT-4在非英语国家医学执照考试中的表现如何？中文语言环境下的一项评估。

PLOS Digit Health. 2023 Dec 1;2(12):e0000397. doi: 10.1371/journal.pdig.0000397. eCollection 2023 Dec.

Generation of salivary glands derived from pluripotent stem cells via conditional blastocyst complementation.通过条件性囊胚互补从多能干细胞生成唾液腺。

bioRxiv. 2023 Nov 15:2023.11.13.566845. doi: 10.1101/2023.11.13.566845.

Letter to the editor response to "ChatGPT, GPT-4, and bard and official board examination: comment".致编辑的信：对“ChatGPT、GPT-4、Bard与官方委员会考试：评论”的回应

Jpn J Radiol. 2024 Feb;42(2):214-215. doi: 10.1007/s11604-023-01515-x. Epub 2023 Nov 23.

Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination.评估 GPT-3.5 和 GPT-4 在波兰医学期末考试中的表现。

Sci Rep. 2023 Nov 22;13(1):20512. doi: 10.1038/s41598-023-46995-z.

Accuracy of ChatGPT on Medical Questions in the National Medical Licensing Examination in Japan: Evaluation Study.ChatGPT在日本国家医师资格考试医学问题上的准确性：评估研究

JMIR Form Res. 2023 Oct 13;7:e48023. doi: 10.2196/48023.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

GPT-4 Turbo with Vision 在日本诊断放射学委员会考试中未能优于仅文本的 GPT-4 Turbo。

GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination.

机构信息

出版信息

PURPOSE

MATERIALS AND METHODS

RESULTS

CONCLUSION

目的

材料与方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献