Xu Xing, Wang Tan, Yang Yang, Zuo Lin, Shen Fumin, Shen Heng Tao
IEEE Trans Neural Netw Learn Syst. 2020 Dec;31(12):5412-5425. doi: 10.1109/TNNLS.2020.2967597. Epub 2020 Nov 30.
The task of image-text matching refers to measuring the visual-semantic similarity between an image and a sentence. Recently, the fine-grained matching methods that explore the local alignment between the image regions and the sentence words have shown advance in inferring the image-text correspondence by aggregating pairwise region-word similarity. However, the local alignment is hard to achieve as some important image regions may be inaccurately detected or even missing. Meanwhile, some words with high-level semantics cannot be strictly corresponding to a single-image region. To tackle these problems, we address the importance of exploiting the global semantic consistence between image regions and sentence words as complementary for the local alignment. In this article, we propose a novel hybrid matching approach named Cross-modal Attention with Semantic Consistency (CASC) for image-text matching. The proposed CASC is a joint framework that performs cross-modal attention for local alignment and multilabel prediction for global semantic consistence. It directly extracts semantic labels from available sentence corpus without additional labor cost, which further provides a global similarity constraint for the aggregated region-word similarity obtained by the local alignment. Extensive experiments on Flickr30k and Microsoft COCO (MSCOCO) data sets demonstrate the effectiveness of the proposed CASC on preserving global semantic consistence along with the local alignment and further show its superior image-text matching performance compared with more than 15 state-of-the-art methods.
图像-文本匹配任务指的是衡量图像与句子之间的视觉语义相似度。最近,通过聚合成对的区域-单词相似度来推断图像-文本对应关系的细粒度匹配方法已取得进展,这类方法探索了图像区域与句子单词之间的局部对齐。然而,由于一些重要的图像区域可能检测不准确甚至缺失,局部对齐很难实现。同时,一些具有高级语义的单词无法严格对应单个图像区域。为了解决这些问题,我们强调利用图像区域与句子单词之间的全局语义一致性作为局部对齐的补充的重要性。在本文中,我们提出了一种用于图像-文本匹配的新颖混合匹配方法,名为具有语义一致性的跨模态注意力(CASC)。所提出的CASC是一个联合框架,它对局部对齐执行跨模态注意力,并对全局语义一致性进行多标签预测。它直接从可用的句子语料库中提取语义标签,无需额外的人工成本,这进一步为通过局部对齐获得的聚合区域-单词相似度提供了全局相似性约束。在Flickr30k和微软COCO(MSCOCO)数据集上进行的大量实验证明了所提出的CASC在保持全局语义一致性以及局部对齐方面的有效性,并进一步表明其与15种以上的最新方法相比具有卓越的图像-文本匹配性能。