Simignore：通过相似度计算探索和增强多模态大模型复杂推理

Zhang Xiaofeng, Zeng Fanshuo, Gu Chaochen

Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, Shanghai, China.

Central South University, 932 South Lushan Road, Yuelu District, Changsha, 410083, Hunan, China.

Neural Netw. 2025 Apr;184:107059. doi: 10.1016/j.neunet.2024.107059. Epub 2024 Dec 31.

Recently, the field of multimodal large language models (MLLMs) has grown rapidly, with many Large Vision-Language Models (LVLMs) relying on sequential visual representations. In these models, images are broken down into numerous tokens before being fed into the Large Language Model (LLM) alongside text prompts. However, the opaque nature of these models poses significant challenges to their interpretability, particularly when dealing with complex reasoning tasks. To address this issue, we utilized Grad-CAM to investigate the interaction dynamics between images and text within complex reasoning processes. Our information flow analysis revealed a distinct pattern: it tends to converge in the initial layers and then disperse as it progresses through deeper layers. This pattern suggests that the early stages of processing focus on the interaction between visual and textual elements, while later stages engage in deeper reasoning. We developed Simignore, a novel image token reduction technique based on this insight. Simignore enhances the model's complex reasoning capabilities by calculating the similarity between image and text embeddings, thereby ignoring tokens that are not semantically relevant. Extensive experiments across different MLLM architectures have shown that our approach consistently improves performance in complex reasoning tasks. This work not only contributes to the advancement of MLLM interpretability but also provides a robust framework for future research in this area. The paper's source code can be accessed from https://github.com/FanshuoZeng/Simignore.

最近，多模态大语言模型（MLLMs）领域发展迅速，许多大视觉语言模型（LVLMs）依赖于序列视觉表示。在这些模型中，图像在与文本提示一起输入大语言模型（LLM）之前会被分解为大量令牌。然而，这些模型的不透明性质对其可解释性构成了重大挑战，尤其是在处理复杂推理任务时。为了解决这个问题，我们利用Grad-CAM来研究复杂推理过程中图像与文本之间的交互动态。我们的信息流分析揭示了一种独特的模式：它倾向于在初始层收敛，然后在深入到更深层时分散。这种模式表明，处理的早期阶段专注于视觉和文本元素之间的交互，而后期阶段则进行更深入的推理。基于这一见解，我们开发了一种新颖的图像令牌减少技术Simignore。Simignore通过计算图像和文本嵌入之间的相似度来增强模型的复杂推理能力，从而忽略语义上不相关的令牌。在不同MLLM架构上进行的大量实验表明，我们的方法在复杂推理任务中始终能提高性能。这项工作不仅有助于推进MLLM的可解释性，还为该领域的未来研究提供了一个强大的框架。该论文的源代码可从https://github.com/FanshuoZeng/Simignore获取。

相似文献

Simignore: Exploring and enhancing multimodal large model complex reasoning via similarity computation.

Neural Netw. 2025 Apr;184:107059. doi: 10.1016/j.neunet.2024.107059. Epub 2024 Dec 31.

A survey on multimodal large language models.

Natl Sci Rev. 2024 Nov 12;11(12):nwae403. doi: 10.1093/nsr/nwae403. eCollection 2024 Dec.

Structured Multimodal Attentions for TextVQA.

IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):9603-9614. doi: 10.1109/TPAMI.2021.3132034. Epub 2022 Nov 7.

Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval.

Neural Netw. 2025 Apr;184:107028. doi: 10.1016/j.neunet.2024.107028. Epub 2024 Dec 16.

MMAgentRec, a personalized multi-modal recommendation agent with large language model.

Sci Rep. 2025 Apr 8;15(1):12062. doi: 10.1038/s41598-025-96458-w.

Beyond the Hype: A Dispassionate Look at Vision-Language Models in Medical Scenario.

IEEE Trans Neural Netw Learn Syst. 2025 Apr 24;PP. doi: 10.1109/TNNLS.2025.3558857.

Predicting Semantic Similarity Between Clinical Sentence Pairs Using Transformer Models: Evaluation and Representational Analysis.

JMIR Med Inform. 2021 May 26;9(5):e23099. doi: 10.2196/23099.

DICCR: Double-gated intervention and confounder causal reasoning for vision-language navigation.

Neural Netw. 2025 Apr;184:107078. doi: 10.1016/j.neunet.2024.107078. Epub 2024 Dec 30.

LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text.

Neural Netw. 2023 May;162:318-329. doi: 10.1016/j.neunet.2023.03.010. Epub 2023 Mar 11.

Interpretable medical image Visual Question Answering via multi-modal relationship graph learning.

Med Image Anal. 2024 Oct;97:103279. doi: 10.1016/j.media.2024.103279. Epub 2024 Jul 20.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

Simignore: Exploring and enhancing multimodal large model complex reasoning via similarity computation.

Neural Netw. 2025 Apr;184:107059. doi: 10.1016/j.neunet.2024.107059. Epub 2024 Dec 31.

A survey on multimodal large language models.

Natl Sci Rev. 2024 Nov 12;11(12):nwae403. doi: 10.1093/nsr/nwae403. eCollection 2024 Dec.

Structured Multimodal Attentions for TextVQA.

IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):9603-9614. doi: 10.1109/TPAMI.2021.3132034. Epub 2022 Nov 7.

Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval.

Neural Netw. 2025 Apr;184:107028. doi: 10.1016/j.neunet.2024.107028. Epub 2024 Dec 16.

MMAgentRec, a personalized multi-modal recommendation agent with large language model.

Sci Rep. 2025 Apr 8;15(1):12062. doi: 10.1038/s41598-025-96458-w.

Beyond the Hype: A Dispassionate Look at Vision-Language Models in Medical Scenario.

IEEE Trans Neural Netw Learn Syst. 2025 Apr 24;PP. doi: 10.1109/TNNLS.2025.3558857.

Predicting Semantic Similarity Between Clinical Sentence Pairs Using Transformer Models: Evaluation and Representational Analysis.

JMIR Med Inform. 2021 May 26;9(5):e23099. doi: 10.2196/23099.

DICCR: Double-gated intervention and confounder causal reasoning for vision-language navigation.

Neural Netw. 2025 Apr;184:107078. doi: 10.1016/j.neunet.2024.107078. Epub 2024 Dec 30.

LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text.

Neural Netw. 2023 May;162:318-329. doi: 10.1016/j.neunet.2023.03.010. Epub 2023 Mar 11.

Interpretable medical image Visual Question Answering via multi-modal relationship graph learning.

Med Image Anal. 2024 Oct;97:103279. doi: 10.1016/j.media.2024.103279. Epub 2024 Jul 20.

Simignore: Exploring and enhancing multimodal large model complex reasoning via similarity computation.

作者信息

机构信息

出版信息

相似文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献