Li Jiangtong, Liu Liu, Niu Li, Zhang Liqing
IEEE Trans Image Process. 2021;30:9193-9207. doi: 10.1109/TIP.2021.3123553. Epub 2021 Nov 10.
Image-text retrieval aims to capture the semantic correlation between images and texts. Existing image-text retrieval methods can be roughly categorized into embedding learning paradigm and pair-wise learning paradigm. The former paradigm fails to capture the fine-grained correspondence between images and texts. The latter paradigm achieves fine-grained alignment between regions and words, but the high cost of pair-wise computation leads to slow retrieval speed. In this paper, we propose a novel method named MEMBER by using Memory-based EMBedding Enhancement for image-text Retrieval (MEMBER), which introduces global memory banks to enable fine-grained alignment and fusion in embedding learning paradigm. Specifically, we enrich image (resp., text) features with relevant text (resp., image) features stored in the text (resp., image) memory bank. In this way, our model not only accomplishes mutual embedding enhancement across two modalities, but also maintains the retrieval efficiency. Extensive experiments demonstrate that our MEMBER remarkably outperforms state-of-the-art approaches on two large-scale benchmark datasets.
图像-文本检索旨在捕捉图像与文本之间的语义关联。现有的图像-文本检索方法大致可分为嵌入学习范式和成对学习范式。前一种范式无法捕捉图像与文本之间的细粒度对应关系。后一种范式实现了区域与单词之间的细粒度对齐,但成对计算的高成本导致检索速度较慢。在本文中,我们提出了一种名为MEMBER的新方法,即基于内存的嵌入增强图像-文本检索(MEMBER),该方法引入全局内存库,以在嵌入学习范式中实现细粒度对齐和融合。具体来说,我们用存储在文本(或图像)内存库中的相关文本(或图像)特征来丰富图像(或文本)特征。通过这种方式,我们的模型不仅实现了跨两种模态的相互嵌入增强,还保持了检索效率。大量实验表明,我们的MEMBER在两个大规模基准数据集上显著优于现有最先进的方法。