Zhang Xiaofeng, Zeng Fanshuo, Gu Chaochen
Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, Shanghai, China.
Central South University, 932 South Lushan Road, Yuelu District, Changsha, 410083, Hunan, China.
Neural Netw. 2025 Apr;184:107059. doi: 10.1016/j.neunet.2024.107059. Epub 2024 Dec 31.
Recently, the field of multimodal large language models (MLLMs) has grown rapidly, with many Large Vision-Language Models (LVLMs) relying on sequential visual representations. In these models, images are broken down into numerous tokens before being fed into the Large Language Model (LLM) alongside text prompts. However, the opaque nature of these models poses significant challenges to their interpretability, particularly when dealing with complex reasoning tasks. To address this issue, we utilized Grad-CAM to investigate the interaction dynamics between images and text within complex reasoning processes. Our information flow analysis revealed a distinct pattern: it tends to converge in the initial layers and then disperse as it progresses through deeper layers. This pattern suggests that the early stages of processing focus on the interaction between visual and textual elements, while later stages engage in deeper reasoning. We developed Simignore, a novel image token reduction technique based on this insight. Simignore enhances the model's complex reasoning capabilities by calculating the similarity between image and text embeddings, thereby ignoring tokens that are not semantically relevant. Extensive experiments across different MLLM architectures have shown that our approach consistently improves performance in complex reasoning tasks. This work not only contributes to the advancement of MLLM interpretability but also provides a robust framework for future research in this area. The paper's source code can be accessed from https://github.com/FanshuoZeng/Simignore.
最近,多模态大语言模型(MLLMs)领域发展迅速,许多大视觉语言模型(LVLMs)依赖于序列视觉表示。在这些模型中,图像在与文本提示一起输入大语言模型(LLM)之前会被分解为大量令牌。然而,这些模型的不透明性质对其可解释性构成了重大挑战,尤其是在处理复杂推理任务时。为了解决这个问题,我们利用Grad-CAM来研究复杂推理过程中图像与文本之间的交互动态。我们的信息流分析揭示了一种独特的模式:它倾向于在初始层收敛,然后在深入到更深层时分散。这种模式表明,处理的早期阶段专注于视觉和文本元素之间的交互,而后期阶段则进行更深入的推理。基于这一见解,我们开发了一种新颖的图像令牌减少技术Simignore。Simignore通过计算图像和文本嵌入之间的相似度来增强模型的复杂推理能力,从而忽略语义上不相关的令牌。在不同MLLM架构上进行的大量实验表明,我们的方法在复杂推理任务中始终能提高性能。这项工作不仅有助于推进MLLM的可解释性,还为该领域的未来研究提供了一个强大的框架。该论文的源代码可从https://github.com/FanshuoZeng/Simignore获取。