• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一年之后:评估多模态大语言模型在RSNA 2024每日病例问题上的性能进展。

One Year On: Assessing Progress of Multimodal Large Language Model Performance on RSNA 2024 Case of the Day Questions.

作者信息

Hou Benjamin, Mukherjee Pritam, Batheja Vivek, Wang Kenneth C, Summers Ronald M, Lu Zhiyong

机构信息

Division of Intramural Research, National Library of Medicine, National Institutes of Health, Bethesda, Md.

Imaging Biomarkers and Computer Aided Diagnosis Lab, Clinical Center, National Institutes of Health, 8600 Rockville Pike, Bldg 38A Lister Hill, Bethesda, MD 20894.

出版信息

Radiology. 2025 Aug;316(2):e250617. doi: 10.1148/radiol.250617.

DOI:10.1148/radiol.250617
PMID:40793942
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12405704/
Abstract

Background With the growing use of multimodal large language models (LLMs), numerous vision-enabled models have been developed and made available to the public. Purpose To assess and quantify the advancements of multimodal LLMs in interpreting radiologic quiz cases by examining both image and textual content over the course of 1 year, and to compare model performance with that of radiologists. Materials and Methods For this retrospective study, 95 questions from Case of the Day at the RSNA 2024 Annual Meeting were collected. Seventy-six questions from the 2023 meeting were included as a baseline for comparison. The test accuracies of prominent multimodal LLMs (including OpenAI's ChatGPT, Google's Gemini, and Meta's open-source Llama 3.2 models) were evaluated and compared with each other and with the accuracies of two senior radiologists. McNemar statistical test was used to assess statistical significance. Results The newly released models OpenAI o1 and GPT-4o achieved scores on the questions from 2024 of 59% (56 of 95; 95% CI: 48, 69) and 54% (51 of 95; 95% CI: 43, 64), respectively, whereas Gemini 1.5 Pro (Google) achieved a score of 36% (34 of 95; 95% CI: 26, 46), and Llama 3.2-90B-Vision (Meta) achieved a score of 33% (31 of 95; 95% CI: 23, 43). For the questions from 2023, OpenAI o1 and GPT-4o scored 62% (47 of 76; 95% CI: 50, 73) and 54% (41 of 76; 95% CI: 42, 65), respectively. GPT-4 (from 2023), the only publicly available vision-language model from OpenAI last year, achieved 43% (33 of 76; 95% CI: 32, 55). The accuracy of OpenAI o1 on the 2024 questions (59%) was comparable to two radiologists, who scored 58% (55 of 95; 95% CI: 47, 68; = .99) and 66% (63 of 95; 95% CI: 56, 76; = .99). Conclusion In 1 year, multimodal LLMs demonstrated substantial advancements, with latest models from OpenAI outperforming those from Google and Meta. Notably, there was no evidence of a statistically significant difference between the accuracy of OpenAI o1 and the accuracies of two expert radiologists. © RSNA, 2025 See also the editorial by Suh and Suh in this issue.

摘要

背景 随着多模态大语言模型(LLMs)的使用日益增加,众多具备视觉功能的模型已被开发并向公众开放。目的 通过在1年的时间里对图像和文本内容进行检查,评估和量化多模态LLMs在解读放射学问答病例方面的进展,并将模型性能与放射科医生的性能进行比较。材料与方法 对于这项回顾性研究,收集了2024年RSNA年会每日病例中的95个问题。2023年会议的76个问题作为比较基线被纳入。评估了著名多模态LLMs(包括OpenAI的ChatGPT、谷歌的Gemini和Meta的开源Llama 3.2模型)的测试准确率,并相互比较,同时与两位资深放射科医生的准确率进行比较。采用McNemar统计检验来评估统计学意义。结果 新发布的模型OpenAI o1和GPT-4o在2024年问题上的得分分别为59%(95题中答对56题;95%置信区间:48,69)和54%(95题中答对51题;95%置信区间:43,64),而Gemini 1.5 Pro(谷歌)的得分为36%(95题中答对34题;95%置信区间:26,46),Llama 3.2-90B-Vision(Meta)的得分为33%(95题中答对31题;95%置信区间:23,43)。对于2023年的问题,OpenAI o1和GPT-4o的得分分别为62%(76题中答对47题;95%置信区间:50,73)和54%(76题中答对41题;95%置信区间:42,65)。GPT-4(2023年)是OpenAI去年唯一公开可用的视觉语言模型,其准确率为43%(76题中答对33题;95%置信区间:32,55)。OpenAI o1在2024年问题上(59%)的准确率与两位放射科医生相当,他们的准确率分别为58%(95题中答对55题;95%置信区间:47,68;P = 0.99)和66%(95题中答对63题;95%置信区间:56,76;P = 0.99)。结论 在1年的时间里,多模态LLMs取得了显著进展,OpenAI的最新模型优于谷歌和Meta的模型。值得注意的是,没有证据表明OpenAI o1的准确率与两位专家放射科医生的准确率之间存在统计学上的显著差异。©RSNA,2025 另见本期Suh和Suh的社论。

相似文献

1
One Year On: Assessing Progress of Multimodal Large Language Model Performance on RSNA 2024 Case of the Day Questions.一年之后:评估多模态大语言模型在RSNA 2024每日病例问题上的性能进展。
Radiology. 2025 Aug;316(2):e250617. doi: 10.1148/radiol.250617.
2
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响:比较案例研究
JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.
3
Ophthalmological Question Answering and Reasoning Using OpenAI o1 vs Other Large Language Models.使用OpenAI的o1与其他大语言模型进行眼科问答和推理
JAMA Ophthalmol. 2025 Jul 31. doi: 10.1001/jamaophthalmol.2025.2413.
4
Comparison of a Specialized Large Language Model with GPT-4o for CT and MRI Radiology Report Summarization.一种用于CT和MRI放射学报告总结的专业大语言模型与GPT-4o的比较。
Radiology. 2025 Aug;316(2):e243774. doi: 10.1148/radiol.243774.
5
OpenAI o1 Large Language Model Outperforms GPT-4o, Gemini 1.5 Flash, and Human Test Takers on Ophthalmology Board-Style Questions.OpenAI的o1大语言模型在眼科委员会风格的问题上表现优于GPT-4o、Gemini 1.5 Flash和人类考生。
Ophthalmol Sci. 2025 Jun 6;5(6):100844. doi: 10.1016/j.xops.2025.100844. eCollection 2025 Nov-Dec.
6
Large language models (LLMs) in radiology exams for medical students: Performance and consequences.面向医学生的放射学考试中的大语言模型:表现与影响。
Rofo. 2024 Nov 4. doi: 10.1055/a-2437-2067.
7
Evaluating the Performance of Reasoning Large Language Models on Japanese Radiology Board Examination Questions.评估推理大型语言模型在日本放射学委员会考试问题上的表现。
Acad Radiol. 2025 May 17. doi: 10.1016/j.acra.2025.04.060.
8
Role of Artificial Intelligence in Surgical Training by Assessing GPT-4 and GPT-4o on the Japan Surgical Board Examination With Text-Only and Image-Accompanied Questions: Performance Evaluation Study.通过在日本外科医师资格考试中使用纯文本和图文并茂的问题评估GPT-4和GPT-4o来研究人工智能在外科培训中的作用:性能评估研究
JMIR Med Educ. 2025 Jul 30;11:e69313. doi: 10.2196/69313.
9
Large Language Models and Empathy: Systematic Review.大语言模型与同理心:系统综述
J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.
10
Assessing the System-Instruction Vulnerabilities of Large Language Models to Malicious Conversion Into Health Disinformation Chatbots.评估大语言模型的系统指令漏洞,以防其被恶意转化为健康虚假信息聊天机器人。
Ann Intern Med. 2025 Jun 24. doi: 10.7326/ANNALS-24-03933.

本文引用的文献

1
Evaluation of GPT Large Language Model Performance on RSNA 2023 Case of the Day Questions.评估 GPT 大语言模型在 RSNA 2023 每日病例问题上的表现。
Radiology. 2024 Oct;313(1):e240609. doi: 10.1148/radiol.240609.
2
Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine.医学领域多模态GPT-4视觉专家级准确性背后的隐藏缺陷。
NPJ Digit Med. 2024 Jul 23;7(1):190. doi: 10.1038/s41746-024-01185-7.