IEEE Trans Image Process. 2022;31:4266-4277. doi: 10.1109/TIP.2022.3181516. Epub 2022 Jun 29.
Visual grounding is a task to localize an object described by a sentence in an image. Conventional visual grounding methods extract visual and linguistic features isolatedly and then perform cross-modal interaction in a post-fusion manner. We argue that this post-fusion mechanism does not fully utilize the information in two modalities. Instead, it is more desired to perform cross-modal interaction during the extraction process of the visual and linguistic feature. In this paper, we propose a language-customized visual feature learning mechanism where linguistic information guides the extraction of visual feature from the very beginning. We instantiate the mechanism as a one-stage framework named Progressive Language-customized Visual feature learning (PLV). Our proposed PLV consists of a Progressive Language-customized Visual Encoder (PLVE) and a grounding module. We customize the visual feature with linguistic guidance at each stage of the PLVE by Channel-wise Language-guided Interaction Modules (CLIM). Our proposed PLV outperforms conventional state-of-the-art methods with large margins across five visual grounding datasets without pre-training on object detection datasets, while achieving real-time speed. The source code is available in the supplementary material.
视觉定位是一项将句子中描述的对象在图像中定位的任务。传统的视觉定位方法分别提取视觉和语言特征,然后以后融合的方式进行跨模态交互。我们认为这种后融合机制没有充分利用两种模态中的信息。相反,在视觉和语言特征的提取过程中进行跨模态交互更为可取。在本文中,我们提出了一种语言定制的视觉特征学习机制,其中语言信息从一开始就指导视觉特征的提取。我们将该机制实例化为一个名为渐进式语言定制视觉特征学习(PLV)的单阶段框架。我们提出的 PLV 由渐进式语言定制视觉编码器(PLVE)和定位模块组成。我们通过通道语言引导交互模块(CLIM)在 PLVE 的每个阶段用语言指导来定制视觉特征。我们提出的 PLV 在五个视觉定位数据集上的表现优于传统的最先进方法,且在不预先训练目标检测数据集的情况下,具有较大的优势,同时实现了实时速度。源代码可在补充材料中获得。