IEEE Trans Image Process. 2023;32:3054-3065. doi: 10.1109/TIP.2023.3277791. Epub 2023 May 30.
We address the problem of referring image segmentation that aims to generate a mask for the object specified by a natural language expression. Many recent works utilize Transformer to extract features for the target object by aggregating the attended visual regions. However, the generic attention mechanism in Transformer only uses the language input for attention weight calculation, which does not explicitly fuse language features in its output. Thus, its output feature is dominated by vision information, which limits the model to comprehensively understand the multi-modal information, and brings uncertainty for the subsequent mask decoder to extract the output mask. To address this issue, we propose Multi-Modal Mutual Attention (M3Att) and Multi-Modal Mutual Decoder ( MDec ) that better fuse information from the two input modalities. Based on MDec , we further propose Iterative Multi-modal Interaction (IMI) to allow continuous and in-depth interactions between language and vision features. Furthermore, we introduce Language Feature Reconstruction (LFR) to prevent the language information from being lost or distorted in the extracted feature. Extensive experiments show that our proposed approach significantly improves the baseline and outperforms state-of-the-art referring image segmentation methods on RefCOCO series datasets consistently.
我们解决了指代图像分割的问题,该问题旨在通过自然语言表达生成目标对象的掩码。许多最近的工作都利用 Transformer 通过聚合关注的视觉区域来提取目标对象的特征。然而,Transformer 中的通用注意力机制仅使用语言输入来计算注意力权重,而不会在其输出中显式融合语言特征。因此,其输出特征主要由视觉信息主导,这限制了模型全面理解多模态信息的能力,并为后续的掩码解码器提取输出掩码带来不确定性。为了解决这个问题,我们提出了多模态互注意力(M3Att)和多模态互解码器(MDec),它们可以更好地融合来自两种输入模态的信息。基于 MDec,我们进一步提出了迭代多模态交互(IMI),以允许语言和视觉特征之间的连续和深入交互。此外,我们引入了语言特征重建(LFR),以防止语言信息在提取的特征中丢失或扭曲。广泛的实验表明,我们提出的方法在 RefCOCO 系列数据集上显著优于基线,并始终优于最先进的指代图像分割方法。