Jing Ya, Wang Wei, Wang Liang, Tan Tieniu
IEEE Trans Image Process. 2021;30:1840-1852. doi: 10.1109/TIP.2020.3048627. Epub 2021 Jan 18.
Image-text matching aims to measure the similarities between images and textual descriptions, which has made great progress recently. The key to this cross-modal matching task is to build the latent semantic alignment between visual objects and words. Due to the widespread variations of sentence structures, it is very difficult to learn the latent semantic alignment using only global cross-modal features. Many previous methods attempt to learn the aligned image-text representations by the attention mechanism but generally ignore the relationships within textual descriptions which determine whether the words belong to the same visual object. In this paper, we propose a graph attentive relational network (GARN) to learn the aligned image-text representations by modeling the relationships between noun phrases in a text for the identity-aware image-text matching. In the GARN, we first decompose images and texts into regions and noun phrases, respectively. Then a skip graph neural network (skip-GNN) is proposed to learn effective textual representations which are a mixture of textual features and relational features. Finally, a graph attention network is further proposed to obtain the probabilities that the noun phrases belong to the image regions by modeling the relationships between noun phrases. We perform extensive experiments on the CUHK Person Description dataset (CUHK-PEDES), Caltech-UCSD Birds dataset (CUB), Oxford-102 Flowers dataset and Flickr30K dataset to verify the effectiveness of each component in our model. Experimental results show that our approach achieves the state-of-the-art results on these four benchmark datasets.
图像-文本匹配旨在衡量图像与文本描述之间的相似度,近年来已取得了重大进展。这项跨模态匹配任务的关键在于建立视觉对象与单词之间的潜在语义对齐。由于句子结构的广泛变化,仅使用全局跨模态特征来学习潜在语义对齐非常困难。许多先前的方法试图通过注意力机制来学习对齐的图像-文本表示,但通常忽略了文本描述中决定单词是否属于同一视觉对象的关系。在本文中,我们提出了一种图注意力关系网络(GARN),通过对文本中名词短语之间的关系进行建模,来学习用于身份感知图像-文本匹配的对齐图像-文本表示。在GARN中,我们首先将图像和文本分别分解为区域和名词短语。然后提出了一种跳跃图神经网络(skip-GNN)来学习有效的文本表示,这些表示是文本特征和关系特征的混合。最后,进一步提出了一种图注意力网络,通过对名词短语之间的关系进行建模,来获得名词短语属于图像区域的概率。我们在香港中文大学行人描述数据集(CUHK-PEDES)、加州理工学院-加州大学圣地亚哥分校鸟类数据集(CUB)、牛津102花卉数据集和Flickr30K数据集上进行了广泛的实验,以验证我们模型中每个组件的有效性。实验结果表明,我们的方法在这四个基准数据集上取得了最优的结果。