• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

Simignore:通过相似度计算探索和增强多模态大模型复杂推理

Simignore: Exploring and enhancing multimodal large model complex reasoning via similarity computation.

作者信息

Zhang Xiaofeng, Zeng Fanshuo, Gu Chaochen

机构信息

Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, Shanghai, China.

Central South University, 932 South Lushan Road, Yuelu District, Changsha, 410083, Hunan, China.

出版信息

Neural Netw. 2025 Apr;184:107059. doi: 10.1016/j.neunet.2024.107059. Epub 2024 Dec 31.

DOI:10.1016/j.neunet.2024.107059
PMID:39787679
Abstract

Recently, the field of multimodal large language models (MLLMs) has grown rapidly, with many Large Vision-Language Models (LVLMs) relying on sequential visual representations. In these models, images are broken down into numerous tokens before being fed into the Large Language Model (LLM) alongside text prompts. However, the opaque nature of these models poses significant challenges to their interpretability, particularly when dealing with complex reasoning tasks. To address this issue, we utilized Grad-CAM to investigate the interaction dynamics between images and text within complex reasoning processes. Our information flow analysis revealed a distinct pattern: it tends to converge in the initial layers and then disperse as it progresses through deeper layers. This pattern suggests that the early stages of processing focus on the interaction between visual and textual elements, while later stages engage in deeper reasoning. We developed Simignore, a novel image token reduction technique based on this insight. Simignore enhances the model's complex reasoning capabilities by calculating the similarity between image and text embeddings, thereby ignoring tokens that are not semantically relevant. Extensive experiments across different MLLM architectures have shown that our approach consistently improves performance in complex reasoning tasks. This work not only contributes to the advancement of MLLM interpretability but also provides a robust framework for future research in this area. The paper's source code can be accessed from https://github.com/FanshuoZeng/Simignore.

摘要

最近,多模态大语言模型(MLLMs)领域发展迅速,许多大视觉语言模型(LVLMs)依赖于序列视觉表示。在这些模型中,图像在与文本提示一起输入大语言模型(LLM)之前会被分解为大量令牌。然而,这些模型的不透明性质对其可解释性构成了重大挑战,尤其是在处理复杂推理任务时。为了解决这个问题,我们利用Grad-CAM来研究复杂推理过程中图像与文本之间的交互动态。我们的信息流分析揭示了一种独特的模式:它倾向于在初始层收敛,然后在深入到更深层时分散。这种模式表明,处理的早期阶段专注于视觉和文本元素之间的交互,而后期阶段则进行更深入的推理。基于这一见解,我们开发了一种新颖的图像令牌减少技术Simignore。Simignore通过计算图像和文本嵌入之间的相似度来增强模型的复杂推理能力,从而忽略语义上不相关的令牌。在不同MLLM架构上进行的大量实验表明,我们的方法在复杂推理任务中始终能提高性能。这项工作不仅有助于推进MLLM的可解释性,还为该领域的未来研究提供了一个强大的框架。该论文的源代码可从https://github.com/FanshuoZeng/Simignore获取。

相似文献

1
Simignore: Exploring and enhancing multimodal large model complex reasoning via similarity computation.Simignore:通过相似度计算探索和增强多模态大模型复杂推理
Neural Netw. 2025 Apr;184:107059. doi: 10.1016/j.neunet.2024.107059. Epub 2024 Dec 31.
2
A survey on multimodal large language models.关于多模态大语言模型的一项调查。
Natl Sci Rev. 2024 Nov 12;11(12):nwae403. doi: 10.1093/nsr/nwae403. eCollection 2024 Dec.
3
Structured Multimodal Attentions for TextVQA.面向文本视觉问答的结构化多模态注意力
IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):9603-9614. doi: 10.1109/TPAMI.2021.3132034. Epub 2022 Nov 7.
4
Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval.用于文本到图像人物检索的文本引导图像恢复与语义增强
Neural Netw. 2025 Apr;184:107028. doi: 10.1016/j.neunet.2024.107028. Epub 2024 Dec 16.
5
MMAgentRec, a personalized multi-modal recommendation agent with large language model.MMAgentRec,一个带有大语言模型的个性化多模态推荐代理。
Sci Rep. 2025 Apr 8;15(1):12062. doi: 10.1038/s41598-025-96458-w.
6
Beyond the Hype: A Dispassionate Look at Vision-Language Models in Medical Scenario.超越炒作:冷静审视医疗场景中的视觉语言模型
IEEE Trans Neural Netw Learn Syst. 2025 Apr 24;PP. doi: 10.1109/TNNLS.2025.3558857.
7
Predicting Semantic Similarity Between Clinical Sentence Pairs Using Transformer Models: Evaluation and Representational Analysis.使用Transformer模型预测临床句子对之间的语义相似性:评估与表征分析
JMIR Med Inform. 2021 May 26;9(5):e23099. doi: 10.2196/23099.
8
DICCR: Double-gated intervention and confounder causal reasoning for vision-language navigation.DICCR:用于视觉语言导航的双门干预与混杂因素因果推理
Neural Netw. 2025 Apr;184:107078. doi: 10.1016/j.neunet.2024.107078. Epub 2024 Dec 30.
9
LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text.LCM-Captioner:一种轻量级基于文本的图像字幕生成方法,具有视觉和文本之间的协作机制。
Neural Netw. 2023 May;162:318-329. doi: 10.1016/j.neunet.2023.03.010. Epub 2023 Mar 11.
10
Interpretable medical image Visual Question Answering via multi-modal relationship graph learning.基于多模态关系图学习的可解释医学图像视觉问答。
Med Image Anal. 2024 Oct;97:103279. doi: 10.1016/j.media.2024.103279. Epub 2024 Jul 20.