Yang Zengyi, Li Yunping, Tang Xin, Xie MingHong
Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, Yunnan, China.
Kunming Cigarette Factory, Hongyunhonghe Tobacco Group Company Limited, Kunming, Yunnan, China.
Front Neurorobot. 2024 Dec 23;18:1521603. doi: 10.3389/fnbot.2024.1521603. eCollection 2024.
Existing image fusion methods primarily focus on complex network structure designs while neglecting the limitations of simple fusion strategies in complex scenarios. To address this issue, this study proposes a new method for infrared and visible image fusion based on a multimodal large language model. The method proposed in this paper fully considers the high demand for semantic information in enhancing image quality as well as the fusion strategies in complex scenes. We supplement the features in the fusion network with information from the multimodal large language model and construct a new fusion strategy. To achieve this goal, we design CLIP-driven Information Injection (CII) approach and CLIP-guided Feature Fusion (CFF) strategy. CII utilizes CLIP to extract robust image features rich in semantic information, which serve to supplement the information of infrared and visible features, thereby enhancing their representation capabilities for the scene. CFF further utilizes the robust image features extracted by CLIP to select and fuse the infrared and visible features after the injection of semantic information, addressing the challenges of image fusion in complex scenes. Compared to existing methods, the main advantage of the proposed method lies in leveraging the powerful semantic understanding capabilities of the multimodal large language model to supplement information for infrared and visible features, thus avoiding the need for complex network structure designs. Experimental results on multiple public datasets validate the effectiveness and superiority of the proposed method.
现有的图像融合方法主要集中在复杂的网络结构设计上,而忽略了简单融合策略在复杂场景中的局限性。为了解决这个问题,本研究提出了一种基于多模态大语言模型的红外与可见光图像融合新方法。本文提出的方法充分考虑了在提高图像质量方面对语义信息的高要求以及复杂场景中的融合策略。我们用来自多模态大语言模型的信息补充融合网络中的特征,并构建一种新的融合策略。为实现这一目标,我们设计了CLIP驱动的信息注入(CII)方法和CLIP引导的特征融合(CFF)策略。CII利用CLIP提取富含语义信息的鲁棒图像特征,用于补充红外和可见光特征的信息,从而增强它们对场景的表示能力。CFF进一步利用CLIP提取的鲁棒图像特征,在注入语义信息后选择并融合红外和可见光特征,解决复杂场景中的图像融合挑战。与现有方法相比,所提方法的主要优势在于利用多模态大语言模型强大的语义理解能力为红外和可见光特征补充信息,从而避免了复杂网络结构设计的需求。在多个公共数据集上的实验结果验证了所提方法的有效性和优越性。