Suppr超能文献

超越炒作:冷静审视医疗场景中的视觉语言模型

Beyond the Hype: A Dispassionate Look at Vision-Language Models in Medical Scenario.

作者信息

Nan Yang, Zhou Huichi, Xing Xiaodan, Yang Guang

出版信息

IEEE Trans Neural Netw Learn Syst. 2025 Apr 24;PP. doi: 10.1109/TNNLS.2025.3558857.

Abstract

Recent advancements in large vision-language models (LVLMs) have demonstrated remarkable capabilities across diverse tasks, garnering significant attention in AI communities. However, their performance and reliability in specialized domains such as medicine remain insufficiently assessed. In particular, most assessments overconcentrate on evaluating VLMs based on simple visual question answering (VQA) on multimodality data while ignoring the in-depth characteristics of LVLMs. In this study, we introduce RadVUQA, a novel radiological visual understanding and question answering benchmark, to comprehensively evaluate existing LVLMs. RadVUQA mainly validates LVLMs across five dimensions: 1) anatomical understanding, assessing the models' ability to visually identify biological structures; 2) multimodal comprehension, which involves the capability of interpreting linguistic and visual instructions to produce desired outcomes; 3) quantitative and spatial reasoning, evaluating the models' spatial awareness and proficiency in combining quantitative analysis with visual and linguistic information; 4) physiological knowledge, measuring the models' capability to comprehend functions and mechanisms of organs and systems; and 5) robustness, which assesses the models' capabilities against unharmonized and synthetic data. The results indicate that both generalized LVLMs and medical-specific LVLMs have critical deficiencies with weak multimodal comprehension and quantitative reasoning capabilities. Our findings reveal the large gap between existing LVLMs and clinicians, highlighting the urgent need for more robust and intelligent LVLMs. The code is available at https://github.com/Nandayang/RadVUQA.

摘要

大型视觉语言模型(LVLMs)的最新进展在各种任务中展现出了卓越的能力,在人工智能社区引起了广泛关注。然而,它们在医学等专业领域的性能和可靠性仍未得到充分评估。特别是,大多数评估过度集中于基于多模态数据的简单视觉问答(VQA)来评估视觉语言模型,而忽略了LVLMs的深入特征。在本研究中,我们引入了RadVUQA,这是一个新颖的放射学视觉理解和问答基准,以全面评估现有的LVLMs。RadVUQA主要从五个维度验证LVLMs:1)解剖学理解,评估模型视觉识别生物结构的能力;2)多模态理解,涉及解释语言和视觉指令以产生预期结果的能力;3)定量和空间推理,评估模型的空间意识以及将定量分析与视觉和语言信息相结合的熟练程度;4)生理知识,衡量模型理解器官和系统功能及机制的能力;5)鲁棒性,评估模型处理不协调和合成数据的能力。结果表明,通用LVLMs和医学专用LVLMs都存在严重缺陷,多模态理解和定量推理能力较弱。我们的研究结果揭示了现有LVLMs与临床医生之间的巨大差距,凸显了对更强大、更智能的LVLMs的迫切需求。代码可在https://github.com/Nandayang/RadVUQA获取。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验