Li Kunpeng, Zhang Yulun, Li Kai, Li Yuanyuan, Fu Yun
IEEE Trans Pattern Anal Mach Intell. 2023 Jan;45(1):641-656. doi: 10.1109/TPAMI.2022.3148470. Epub 2022 Dec 5.
As a bridge between language and vision domains, cross-modal retrieval between images and texts is a hot research topic in recent years. It remains challenging because the current image representations usually lack semantic concepts in the corresponding sentence captions. To address this issue, we introduce an intuitive and interpretable model to learn a common embedding space for alignments between images and text descriptions. Specifically, our model first incorporates the semantic relationship information into visual and textual features by performing region or word relationship reasoning. Then it utilizes the gate and memory mechanism to perform global semantic reasoning on these relationship-enhanced features, select the discriminative information and gradually grow representations for the whole scene. Through the alignment learning, the learned visual representations capture key objects and semantic concepts of a scene as in the corresponding text caption. Experiments on MS-COCO [1] and Flickr30K [2] datasets validate that our method surpasses many recent state-of-the-arts with a clear margin. In addition to the effectiveness, our methods are also very efficient at the inference stage. Thanks to the effective overall representation learning with visual semantic reasoning, our methods can already achieve very strong performance by only relying on the simple inner-product to obtain similarity scores between images and captions. Experiments validate the proposed methods are more than 30-75 times faster than many recent methods with code public available. Instead of following the recent trend of using complex local matching strategies [3], [4], [5], [6] to pursue good performance while sacrificing efficiency, we show that the simple global matching strategy can still be very effective, efficient and achieve even better performance based on our framework.
作为语言和视觉领域之间的桥梁,图像与文本之间的跨模态检索是近年来的一个热门研究课题。它仍然具有挑战性,因为当前的图像表示通常在相应的句子描述中缺乏语义概念。为了解决这个问题,我们引入了一个直观且可解释的模型,以学习用于图像与文本描述对齐的公共嵌入空间。具体来说,我们的模型首先通过执行区域或单词关系推理,将语义关系信息纳入视觉和文本特征中。然后,它利用门控和记忆机制对这些关系增强的特征进行全局语义推理,选择有区分性的信息,并逐步生成整个场景的表示。通过对齐学习,所学习到的视觉表示能够捕捉场景中的关键对象和语义概念,就像在相应的文本描述中一样。在MS-COCO [1]和Flickr30K [2]数据集上的实验验证了我们的方法以明显优势超越了许多近期的先进方法。除了有效性之外,我们的方法在推理阶段也非常高效。由于通过视觉语义推理进行了有效的整体表示学习,我们的方法仅依靠简单的内积来获得图像与标题之间的相似度得分,就已经能够实现非常强大的性能。实验验证了所提出的方法比许多近期公开代码的方法快30至75倍以上。我们没有遵循近期使用复杂局部匹配策略[3]、[4]、[5]、[6]来追求良好性能却牺牲效率的趋势,而是表明简单的全局匹配策略在我们的框架下仍然可以非常有效、高效,甚至能实现更好的性能。