Feng Guang, Hu Zhiwei, Zhang Lihe, Sun Jiayu, Lu Huchuan
IEEE Trans Neural Netw Learn Syst. 2023 May;34(5):2246-2258. doi: 10.1109/TNNLS.2021.3106153. Epub 2023 May 2.
Recently, referring image localization and segmentation has aroused widespread interest. However, the existing methods lack a clear description of the interdependence between language and vision. To this end, we present a bidirectional relationship inferring network (BRINet) to effectively address the challenging tasks. Specifically, we first employ a vision-guided linguistic attention module to perceive the keywords corresponding to each image region. Then, language-guided visual attention adopts the learned adaptive language to guide the update of the visual features. Together, they form a bidirectional cross-modal attention module (BCAM) to achieve the mutual guidance between language and vision. They can help the network align the cross-modal features better. Based on the vanilla language-guided visual attention, we further design an asymmetric language-guided visual attention, which significantly reduces the computational cost by modeling the relationship between each pixel and each pooled subregion. In addition, a segmentation-guided bottom-up augmentation module (SBAM) is utilized to selectively combine multilevel information flow for object localization. Experiments show that our method outperforms other state-of-the-art methods on three referring image localization datasets and four referring image segmentation datasets.
最近,指称图像定位与分割引起了广泛关注。然而,现有方法缺乏对语言与视觉之间相互依存关系的清晰描述。为此,我们提出了一种双向关系推理网络(BRINet)来有效解决这些具有挑战性的任务。具体而言,我们首先采用视觉引导的语言注意力模块来感知每个图像区域对应的关键词。然后,语言引导的视觉注意力采用学习到的自适应语言来指导视觉特征的更新。它们共同形成一个双向跨模态注意力模块(BCAM),以实现语言与视觉之间的相互引导。这有助于网络更好地对齐跨模态特征。基于普通的语言引导视觉注意力,我们进一步设计了一种非对称语言引导视觉注意力,通过对每个像素与每个池化子区域之间的关系进行建模,显著降低了计算成本。此外,利用一个分割引导的自底向上增强模块(SBAM)来选择性地组合多级信息流以进行目标定位。实验表明,我们的方法在三个指称图像定位数据集和四个指称图像分割数据集上优于其他现有最先进方法。