Department of Radiology, The International University of Health and Welfare Narita Hospital, 852 Hatakeda, Narita, Chiba, Japan.
Department of Radiology, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan.
Jpn J Radiol. 2024 Aug;42(8):918-926. doi: 10.1007/s11604-024-01561-z. Epub 2024 May 11.
To assess the performance of GPT-4 Turbo with Vision (GPT-4TV), OpenAI's latest multimodal large language model, by comparing its ability to process both text and image inputs with that of the text-only GPT-4 Turbo (GPT-4 T) in the context of the Japan Diagnostic Radiology Board Examination (JDRBE).
The dataset comprised questions from JDRBE 2021 and 2023. A total of six board-certified diagnostic radiologists discussed the questions and provided ground-truth answers by consulting relevant literature as necessary. The following questions were excluded: those lacking associated images, those with no unanimous agreement on answers, and those including images rejected by the OpenAI application programming interface. The inputs for GPT-4TV included both text and images, whereas those for GPT-4 T were entirely text. Both models were deployed on the dataset, and their performance was compared using McNemar's exact test. The radiological credibility of the responses was assessed by two diagnostic radiologists through the assignment of legitimacy scores on a five-point Likert scale. These scores were subsequently used to compare model performance using Wilcoxon's signed-rank test.
The dataset comprised 139 questions. GPT-4TV correctly answered 62 questions (45%), whereas GPT-4 T correctly answered 57 questions (41%). A statistical analysis found no significant performance difference between the two models (P = 0.44). The GPT-4TV responses received significantly lower legitimacy scores from both radiologists than the GPT-4 T responses.
No significant enhancement in accuracy was observed when using GPT-4TV with image input compared with that of using text-only GPT-4 T for JDRBE questions.
通过比较 GPT-4 文字型涡轮机(GPT-4T)和 GPT-4 文字与影像型涡轮机(GPT-4TV)在日本诊断放射线学会考试(JDRBE)中的表现,评估 OpenAI 最新的多模态大型语言模型 GPT-4TV 的性能。
本研究数据集由 JDRBE 2021 年和 2023 年的考题组成。共有六名经董事会认证的放射科医生对考题进行了讨论,并在必要时参考相关文献提供了标准答案。排除了以下问题:那些缺乏相关影像的问题、那些答案没有达成一致的问题以及那些被 OpenAI 应用程序编程接口拒绝的影像问题。GPT-4TV 的输入包括文字和影像,而 GPT-4T 的输入则完全是文字。将这两个模型应用于数据集,并使用 McNemar 精确检验比较它们的性能。通过在五点 Likert 量表上分配合法性得分,两名放射科医生评估了答案的放射学可信度。随后使用 Wilcoxon 符号秩检验比较模型性能。
数据集包含 139 个问题。GPT-4TV 正确回答了 62 个问题(45%),而 GPT-4T 正确回答了 57 个问题(41%)。统计分析发现两个模型之间的性能没有显著差异(P=0.44)。GPT-4TV 的答案得到的合法性评分明显低于 GPT-4T 的答案,两位放射科医生都认为如此。
与使用仅文字的 GPT-4T 相比,在 JDRBE 问题上使用包含影像输入的 GPT-4TV 并没有显著提高准确性。