超越炒作：冷静审视医疗场景中的视觉语言模型

Nan Yang, Zhou Huichi, Xing Xiaodan, Yang Guang

IEEE Trans Neural Netw Learn Syst. 2025 Apr 24;PP. doi: 10.1109/TNNLS.2025.3558857.

Recent advancements in large vision-language models (LVLMs) have demonstrated remarkable capabilities across diverse tasks, garnering significant attention in AI communities. However, their performance and reliability in specialized domains such as medicine remain insufficiently assessed. In particular, most assessments overconcentrate on evaluating VLMs based on simple visual question answering (VQA) on multimodality data while ignoring the in-depth characteristics of LVLMs. In this study, we introduce RadVUQA, a novel radiological visual understanding and question answering benchmark, to comprehensively evaluate existing LVLMs. RadVUQA mainly validates LVLMs across five dimensions: 1) anatomical understanding, assessing the models' ability to visually identify biological structures; 2) multimodal comprehension, which involves the capability of interpreting linguistic and visual instructions to produce desired outcomes; 3) quantitative and spatial reasoning, evaluating the models' spatial awareness and proficiency in combining quantitative analysis with visual and linguistic information; 4) physiological knowledge, measuring the models' capability to comprehend functions and mechanisms of organs and systems; and 5) robustness, which assesses the models' capabilities against unharmonized and synthetic data. The results indicate that both generalized LVLMs and medical-specific LVLMs have critical deficiencies with weak multimodal comprehension and quantitative reasoning capabilities. Our findings reveal the large gap between existing LVLMs and clinicians, highlighting the urgent need for more robust and intelligent LVLMs. The code is available at https://github.com/Nandayang/RadVUQA.

大型视觉语言模型（LVLMs）的最新进展在各种任务中展现出了卓越的能力，在人工智能社区引起了广泛关注。然而，它们在医学等专业领域的性能和可靠性仍未得到充分评估。特别是，大多数评估过度集中于基于多模态数据的简单视觉问答（VQA）来评估视觉语言模型，而忽略了LVLMs的深入特征。在本研究中，我们引入了RadVUQA，这是一个新颖的放射学视觉理解和问答基准，以全面评估现有的LVLMs。RadVUQA主要从五个维度验证LVLMs：1）解剖学理解，评估模型视觉识别生物结构的能力；2）多模态理解，涉及解释语言和视觉指令以产生预期结果的能力；3）定量和空间推理，评估模型的空间意识以及将定量分析与视觉和语言信息相结合的熟练程度；4）生理知识，衡量模型理解器官和系统功能及机制的能力；5）鲁棒性，评估模型处理不协调和合成数据的能力。结果表明，通用LVLMs和医学专用LVLMs都存在严重缺陷，多模态理解和定量推理能力较弱。我们的研究结果揭示了现有LVLMs与临床医生之间的巨大差距，凸显了对更强大、更智能的LVLMs的迫切需求。代码可在https://github.com/Nandayang/RadVUQA获取。

相似文献

Beyond the Hype: A Dispassionate Look at Vision-Language Models in Medical Scenario.

IEEE Trans Neural Netw Learn Syst. 2025 Apr 24;PP. doi: 10.1109/TNNLS.2025.3558857.

LVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models.

IEEE Trans Pattern Anal Mach Intell. 2024 Nov 27;PP. doi: 10.1109/TPAMI.2024.3507000.

Interpretable medical image Visual Question Answering via multi-modal relationship graph learning.

Med Image Anal. 2024 Oct;97:103279. doi: 10.1016/j.media.2024.103279. Epub 2024 Jul 20.

Simignore: Exploring and enhancing multimodal large model complex reasoning via similarity computation.

Neural Netw. 2025 Apr;184:107059. doi: 10.1016/j.neunet.2024.107059. Epub 2024 Dec 31.

Advancing surgical VQA with scene graph knowledge.

Int J Comput Assist Radiol Surg. 2024 Jul;19(7):1409-1417. doi: 10.1007/s11548-024-03141-y. Epub 2024 May 23.

Vision-language models for medical report generation and visual question answering: a review.

Front Artif Intell. 2024 Nov 19;7:1430984. doi: 10.3389/frai.2024.1430984. eCollection 2024.

Rich Visual Knowledge-Based Augmentation Network for Visual Question Answering.

IEEE Trans Neural Netw Learn Syst. 2021 Oct;32(10):4362-4373. doi: 10.1109/TNNLS.2020.3017530. Epub 2021 Oct 5.

Cross-Modal self-supervised vision language pre-training with multiple objectives for medical visual question answering.

J Biomed Inform. 2024 Dec;160:104748. doi: 10.1016/j.jbi.2024.104748. Epub 2024 Nov 12.

CARDBiomedBench: A Benchmark for Evaluating Large Language Model Performance in Biomedical Research: A novel question-and-answer benchmark designed to assess Large Language Models' comprehension of biomedical research, piloted on Neurodegenerative Diseases.

bioRxiv. 2025 Jan 21:2025.01.15.633272. doi: 10.1101/2025.01.15.633272.

Counterfactual Dual-Bias VQA: A Multimodality Debias Learning for Robust Visual Question Answering.

IEEE Trans Neural Netw Learn Syst. 2025 Sep;36(9):16366-16378. doi: 10.1109/TNNLS.2025.3562085.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Beyond the Hype: A Dispassionate Look at Vision-Language Models in Medical Scenario.

IEEE Trans Neural Netw Learn Syst. 2025 Apr 24;PP. doi: 10.1109/TNNLS.2025.3558857.

LVLM-EHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models.

IEEE Trans Pattern Anal Mach Intell. 2024 Nov 27;PP. doi: 10.1109/TPAMI.2024.3507000.

Interpretable medical image Visual Question Answering via multi-modal relationship graph learning.

Med Image Anal. 2024 Oct;97:103279. doi: 10.1016/j.media.2024.103279. Epub 2024 Jul 20.

Simignore: Exploring and enhancing multimodal large model complex reasoning via similarity computation.

Neural Netw. 2025 Apr;184:107059. doi: 10.1016/j.neunet.2024.107059. Epub 2024 Dec 31.

Advancing surgical VQA with scene graph knowledge.

Int J Comput Assist Radiol Surg. 2024 Jul;19(7):1409-1417. doi: 10.1007/s11548-024-03141-y. Epub 2024 May 23.

Vision-language models for medical report generation and visual question answering: a review.

Front Artif Intell. 2024 Nov 19;7:1430984. doi: 10.3389/frai.2024.1430984. eCollection 2024.

Rich Visual Knowledge-Based Augmentation Network for Visual Question Answering.

IEEE Trans Neural Netw Learn Syst. 2021 Oct;32(10):4362-4373. doi: 10.1109/TNNLS.2020.3017530. Epub 2021 Oct 5.

Cross-Modal self-supervised vision language pre-training with multiple objectives for medical visual question answering.

J Biomed Inform. 2024 Dec;160:104748. doi: 10.1016/j.jbi.2024.104748. Epub 2024 Nov 12.

bioRxiv. 2025 Jan 21:2025.01.15.633272. doi: 10.1101/2025.01.15.633272.

Counterfactual Dual-Bias VQA: A Multimodality Debias Learning for Robust Visual Question Answering.

IEEE Trans Neural Netw Learn Syst. 2025 Sep;36(9):16366-16378. doi: 10.1109/TNNLS.2025.3562085.

Beyond the Hype: A Dispassionate Look at Vision-Language Models in Medical Scenario.

作者信息

出版信息

相似文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献