Suppr超能文献

使用CLIP的语言驱动交叉注意力用于可见光-红外图像融合

Language-Driven Cross-Attention for Visible-Infrared Image Fusion Using CLIP.

作者信息

Wang Xue, Wu Jiatong, Zhang Pengfei, Yu Zhongjun

机构信息

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China.

出版信息

Sensors (Basel). 2025 Aug 15;25(16):5083. doi: 10.3390/s25165083.

Abstract

Language-guided multimodal fusion, which integrates information from both visible and infrared images, has shown strong performance in image fusion tasks. In low-light or complex environments, a single modality often fails to fully capture scene features, whereas fused images enable robots to obtain multidimensional scene understanding for navigation, localization, and environmental perception. This capability is particularly important in applications such as autonomous driving, intelligent surveillance, and search-and-rescue operations, where accurate recognition and efficient decision-making are critical. To enhance the effectiveness of multimodal fusion, we propose a text-guided infrared and visible image fusion network. The framework consists of two key components: an image fusion branch, which employs a cross-domain attention mechanism to merge multimodal features, and a text-guided module, which leverages the CLIP model to extract semantic cues from image descriptions containing visible content. These semantic parameters are then used to guide the feature modulation process during fusion. By integrating visual and linguistic information, our framework is capable of generating high-quality color-fused images that not only enhance visual detail but also enrich semantic understanding. On benchmark datasets, our method achieves strong quantitative performance: SF = 2.1381, Qab/f = 0.6329, MI = 14.2305, SD = 0.8527, VIF = 45.1842 on LLVIP, and SF = 1.3149, Qab/f = 0.5863, MI = 13.9676, SD = 94.7203, VIF = 0.7746 on TNO. These results highlight the robustness and scalability of our model, making it a promising solution for real-world multimodal perception applications.

摘要

语言引导的多模态融合整合了可见光和红外图像的信息,在图像融合任务中表现出强大的性能。在低光照或复杂环境中,单一模态往往无法充分捕捉场景特征,而融合后的图像能使机器人获得用于导航、定位和环境感知的多维场景理解。这种能力在自动驾驶、智能监控和搜索救援行动等应用中尤为重要,在这些应用中,准确识别和高效决策至关重要。为了提高多模态融合的有效性,我们提出了一种文本引导的红外与可见光图像融合网络。该框架由两个关键组件组成:一个图像融合分支,它采用跨域注意力机制来合并多模态特征;一个文本引导模块,它利用CLIP模型从包含可见内容的图像描述中提取语义线索。然后,这些语义参数被用于在融合过程中指导特征调制过程。通过整合视觉和语言信息,我们的框架能够生成高质量的彩色融合图像,不仅增强了视觉细节,还丰富了语义理解。在基准数据集上,我们的方法取得了强大的定量性能:在LLVIP数据集上,SF = 2.1381,Qab/f = 0.6329,MI = 14.2305,SD = 0.8527,VIF = 45.1842;在TNO数据集上,SF = 1.3149,Qab/f = 0.5863,MI = 13.9676,SD = 94.7203,VIF = 0.7746。这些结果突出了我们模型的鲁棒性和可扩展性,使其成为现实世界多模态感知应用的一个有前途的解决方案。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d9c6/12390620/c5653e5d1eac/sensors-25-05083-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验