IEEE Trans Image Process. 2021;30:617-627. doi: 10.1109/TIP.2020.3038354. Epub 2020 Dec 1.
Cross-modal retrieval aims to identify relevant data across different modalities. In this work, we are dedicated to cross-modal retrieval between images and text sentences, which is formulated into similarity measurement for each image-text pair. To this end, we propose a Cross-modal Relation Guided Network (CRGN) to embed image and text into a latent feature space. The CRGN model uses GRU to extract text feature and ResNet model to learn the globally guided image feature. Based on the global feature guiding and sentence generation learning, the relation between image regions can be modeled. The final image embedding is generated by a relation embedding module with an attention mechanism. With the image embeddings and text embeddings, we conduct cross-modal retrieval based on the cosine similarity. The learned embedding space well captures the inherent relevance between image and text. We evaluate our approach with extensive experiments on two public benchmark datasets, i.e., MS-COCO and Flickr30K. Experimental results demonstrate that our approach achieves better or comparable performance with the state-of-the-art methods with notable efficiency.
跨模态检索旨在识别不同模态之间的相关数据。在这项工作中,我们专注于图像和文本句子之间的跨模态检索,将其公式化为每个图像-文本对的相似性度量。为此,我们提出了一种跨模态关系引导网络(CRGN),将图像和文本嵌入到潜在特征空间中。CRGN 模型使用 GRU 提取文本特征,使用 ResNet 模型学习全局引导图像特征。基于全局特征引导和句子生成学习,可以对图像区域之间的关系进行建模。最后通过具有注意力机制的关系嵌入模块生成图像嵌入。使用图像嵌入和文本嵌入,我们基于余弦相似度进行跨模态检索。学习到的嵌入空间很好地捕获了图像和文本之间的内在相关性。我们在两个公共基准数据集 MS-COCO 和 Flickr30K 上进行了广泛的实验评估。实验结果表明,我们的方法在效率方面具有显著优势,并且在性能上可与最先进的方法相媲美。