Zhang Feifei, Xu Mingliang, Xu Changsheng
IEEE Trans Image Process. 2022;31:1000-1011. doi: 10.1109/TIP.2021.3138302. Epub 2022 Jan 10.
Composed Query Based Image Retrieval (CQBIR) aims at retrieving images relevant to a composed query containing a reference image with a requested modification expressed via a textual sentence. Compared with the conventional image retrieval which takes one modality as query to retrieve relevant data of another modality, CQBIR poses great challenge over the semantic gap between the reference image and modification text in the composed query. To solve the challenge, previous methods either resort to feature composition that cannot model interactions in the query or explore inter-modal attention while ignoring the spatial structure and visual-semantic relationship. In this paper, we propose a geometry sensitive cross-modal reasoning network for CQBIR by jointly modeling the geometric information of the image and the visual-semantic relationship between the reference image and modification text in the query. Specifically, it contains two key components: a geometry sensitive inter-modal attention module (GS-IMA) and a text-guided visual reasoning module (TG-VR). The GS-IMA introduces the spatial structure into the inter-modal attention in both implicit and explicit manners. The TG-VR models the unequal semantics not included in the reference image to guide further visual reasoning. As a result, our method can learn effective feature for the composed query which does not exhibit literal alignment. Comprehensive experimental results on three standard benchmarks demonstrate that the proposed model performs favorably against state-of-the-art methods.
基于组合查询的图像检索(CQBIR)旨在检索与组合查询相关的图像,该组合查询包含一幅参考图像以及通过文本句子表达的所需修改。与传统的以一种模态作为查询来检索另一种模态的相关数据的图像检索相比,CQBIR在组合查询中的参考图像和修改文本之间的语义鸿沟上提出了巨大挑战。为了解决这一挑战,先前的方法要么诉诸于无法对查询中的交互进行建模的特征组合,要么探索跨模态注意力,却忽略了空间结构和视觉语义关系。在本文中,我们通过联合对图像的几何信息以及查询中参考图像与修改文本之间的视觉语义关系进行建模,提出了一种用于CQBIR的几何敏感跨模态推理网络。具体来说,它包含两个关键组件:一个几何敏感跨模态注意力模块(GS - IMA)和一个文本引导视觉推理模块(TG - VR)。GS - IMA以隐式和显式方式将空间结构引入跨模态注意力中。TG - VR对参考图像中未包含的不等语义进行建模,以指导进一步的视觉推理。结果,我们的方法可以为不呈现文字对齐的组合查询学习有效的特征。在三个标准基准上的综合实验结果表明,所提出的模型优于现有方法。