Shi Zhangxiang, Zhang Tianzhu, Wei Xi, Wu Feng, Zhang Yongdong
IEEE Trans Image Process. 2024;33:1326-1337. doi: 10.1109/TIP.2022.3197972. Epub 2024 Feb 13.
The mainstream of image and sentence matching studies currently focuses on fine-grained alignment of image regions and sentence words. However, these methods miss a crucial fact: the correspondence between images and sentences does not simply come from alignments between individual regions and words but from alignments between the phrases they form respectively. In this work, we propose a novel Decoupled Cross-modal Phrase-Attention network (DCPA) for image-sentence matching by modeling the relationships between textual phrases and visual phrases. Furthermore, we design a novel decoupled manner for training and inferencing, which is able to release the trade-off for bi-directional retrieval, where image-to-sentence matching is executed in textual semantic space and sentence-to-image matching is executed in visual semantic space. Extensive experimental results on Flickr30K and MS-COCO demonstrate that the proposed method outperforms state-of-the-art methods by a large margin, and can compete with some methods introducing external knowledge.
目前,图像与句子匹配研究的主流方向聚焦于图像区域与句子词汇的细粒度对齐。然而,这些方法忽略了一个关键事实:图像与句子之间的对应关系并非仅仅源于单个区域与词汇之间的对齐,而是来自它们各自所形成的短语之间的对齐。在这项工作中,我们提出了一种新颖的解耦跨模态短语注意力网络(DCPA),通过对文本短语和视觉短语之间的关系进行建模来实现图像与句子的匹配。此外,我们设计了一种新颖的解耦训练和推理方式,能够消除双向检索中的权衡,其中图像到句子的匹配在文本语义空间中执行,句子到图像的匹配在视觉语义空间中执行。在Flickr30K和MS-COCO上的大量实验结果表明,所提出的方法大大优于现有方法,并且可以与一些引入外部知识的方法相媲美。