Suppr超能文献

GPT-4 Turbo with Vision 在日本诊断放射学委员会考试中未能优于仅文本的 GPT-4 Turbo。

GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination.

机构信息

Department of Radiology, The International University of Health and Welfare Narita Hospital, 852 Hatakeda, Narita, Chiba, Japan.

Department of Radiology, the University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-Ku, Tokyo, Japan.

出版信息

Jpn J Radiol. 2024 Aug;42(8):918-926. doi: 10.1007/s11604-024-01561-z. Epub 2024 May 11.

Abstract

PURPOSE

To assess the performance of GPT-4 Turbo with Vision (GPT-4TV), OpenAI's latest multimodal large language model, by comparing its ability to process both text and image inputs with that of the text-only GPT-4 Turbo (GPT-4 T) in the context of the Japan Diagnostic Radiology Board Examination (JDRBE).

MATERIALS AND METHODS

The dataset comprised questions from JDRBE 2021 and 2023. A total of six board-certified diagnostic radiologists discussed the questions and provided ground-truth answers by consulting relevant literature as necessary. The following questions were excluded: those lacking associated images, those with no unanimous agreement on answers, and those including images rejected by the OpenAI application programming interface. The inputs for GPT-4TV included both text and images, whereas those for GPT-4 T were entirely text. Both models were deployed on the dataset, and their performance was compared using McNemar's exact test. The radiological credibility of the responses was assessed by two diagnostic radiologists through the assignment of legitimacy scores on a five-point Likert scale. These scores were subsequently used to compare model performance using Wilcoxon's signed-rank test.

RESULTS

The dataset comprised 139 questions. GPT-4TV correctly answered 62 questions (45%), whereas GPT-4 T correctly answered 57 questions (41%). A statistical analysis found no significant performance difference between the two models (P = 0.44). The GPT-4TV responses received significantly lower legitimacy scores from both radiologists than the GPT-4 T responses.

CONCLUSION

No significant enhancement in accuracy was observed when using GPT-4TV with image input compared with that of using text-only GPT-4 T for JDRBE questions.

摘要

目的

通过比较 GPT-4 文字型涡轮机(GPT-4T)和 GPT-4 文字与影像型涡轮机(GPT-4TV)在日本诊断放射线学会考试(JDRBE)中的表现,评估 OpenAI 最新的多模态大型语言模型 GPT-4TV 的性能。

材料与方法

本研究数据集由 JDRBE 2021 年和 2023 年的考题组成。共有六名经董事会认证的放射科医生对考题进行了讨论,并在必要时参考相关文献提供了标准答案。排除了以下问题:那些缺乏相关影像的问题、那些答案没有达成一致的问题以及那些被 OpenAI 应用程序编程接口拒绝的影像问题。GPT-4TV 的输入包括文字和影像,而 GPT-4T 的输入则完全是文字。将这两个模型应用于数据集,并使用 McNemar 精确检验比较它们的性能。通过在五点 Likert 量表上分配合法性得分,两名放射科医生评估了答案的放射学可信度。随后使用 Wilcoxon 符号秩检验比较模型性能。

结果

数据集包含 139 个问题。GPT-4TV 正确回答了 62 个问题(45%),而 GPT-4T 正确回答了 57 个问题(41%)。统计分析发现两个模型之间的性能没有显著差异(P=0.44)。GPT-4TV 的答案得到的合法性评分明显低于 GPT-4T 的答案,两位放射科医生都认为如此。

结论

与使用仅文字的 GPT-4T 相比,在 JDRBE 问题上使用包含影像输入的 GPT-4TV 并没有显著提高准确性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fdf0/11286662/7e6d5b30d652/11604_2024_1561_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验