IEEE Trans Image Process. 2023;32:3622-3633. doi: 10.1109/TIP.2023.3286710. Epub 2023 Jul 3.
Image-text retrieval is a central problem for understanding the semantic relationship between vision and language, and serves as the basis for various visual and language tasks. Most previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words. However, the close relations between coarse- and fine-grained representations for each modality are important for image-text retrieval but almost neglected. As a result, such previous works inevitably suffer from low retrieval accuracy or heavy computational cost. In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework. This framework is consistent with human cognition, as humans simultaneously pay attention to the entire sample and regional elements to understand the semantic content. To this end, a Token-Guided Dual Transformer (TGDT) architecture which consists of two homogeneous branches for image and text modalities, respectively, is proposed for image-text retrieval. The TGDT incorporates both coarse- and fine-grained retrievals into a unified framework and beneficially leverages the advantages of both retrieval approaches. A novel training objective called Consistent Multimodal Contrastive (CMC) loss is proposed accordingly to ensure the intra- and inter-modal semantic consistencies between images and texts in the common embedding space. Equipped with a two-stage inference method based on the mixed global and local cross-modal similarity, the proposed method achieves state-of-the-art retrieval performances with extremely low inference time when compared with representative recent approaches. Code is publicly available: github.com/LCFractal/TGDT.
图像-文本检索是理解视觉和语言之间语义关系的核心问题,也是各种视觉和语言任务的基础。以前的大多数工作要么简单地学习整体图像和文本的粗粒度表示,要么精心建立图像区域或像素与文本单词之间的对应关系。然而,两种模态的粗粒度和细粒度表示之间的密切关系对于图像-文本检索很重要,但几乎被忽略了。因此,以前的这些工作不可避免地存在检索精度低或计算成本高的问题。在这项工作中,我们从一个新的角度来解决图像-文本检索问题,即将粗粒度和细粒度的表示学习结合到一个统一的框架中。这个框架与人类的认知是一致的,因为人类同时关注整个样本和区域元素来理解语义内容。为此,我们提出了一种基于 Token-Guided Dual Transformer (TGDT) 的架构,它由两个分别用于图像和文本模态的同构分支组成,用于图像-文本检索。TGDT 将粗粒度和细粒度检索纳入一个统一的框架中,并有利地利用了这两种检索方法的优势。相应地提出了一种新的训练目标,称为一致多模态对比(CMC)损失,以确保在公共嵌入空间中图像和文本之间的内模态和间模态语义一致性。通过基于混合全局和局部跨模态相似性的两阶段推理方法,与代表性的最新方法相比,该方法在具有极低推理时间的情况下实现了最先进的检索性能。代码可在 github.com/LCFractal/TGDT 上获得。