基于循环注意力网络的模态特定跨模态相似性度量

Modality-specific Cross-modal Similarity Measurement with Recurrent Attention Network.

作者信息

Peng Yuxin, Qi Jinwei, Yuan Yuxin

出版信息

IEEE Trans Image Process. 2018 Jul 2. doi: 10.1109/TIP.2018.2852503.

DOI:10.1109/TIP.2018.2852503

Abstract

Nowadays, cross-modal retrieval plays an important role to flexibly find useful information across different modalities of data. Effectively measuring the similarity between different modalities of data is the key of cross-modal retrieval. Different modalities such as image and text have imbalanced and complementary relationship, and they contain unequal amount of information when describing the same semantics. For example, images often contain more details that cannot be demonstrated by textual descriptions and vice versa. Existing works based on Deep Neural Network (DNN) mostly construct one common space for different modalities, to find the latent alignments between them, which lose their exclusive modality-specific characteristics. Therefore, we propose modality-specific cross-modal similarity measurement (MCSM) approach by constructing the independent semantic space for each modality, which adopts an endto- end framework to directly generate modality-specific crossmodal similarity without explicit common representation. For each semantic space, modality-specific characteristics within one modality are fully exploited by recurrent attention network, while the data of another modality is projected into this space with attention based joint embedding, which utilizes the learned attention weights for guiding the fine-grained cross-modal correlation learning, and captures the imbalanced and complementary relationship between different modalities. Finally, the complementarity between the semantic spaces for different modalities is explored by adaptive fusion of the modality-specific cross-modal similarities to perform cross-modal retrieval. Experiments on the widely-used Wikipedia, Pascal Sentence, MS-COCO datasets as well as our constructed large-scale XMediaNet dataset verify the effectiveness of our proposed approach, outperforming 9 stateof- the-art methods.

摘要

如今，跨模态检索在灵活地跨不同数据模态查找有用信息方面发挥着重要作用。有效衡量不同数据模态之间的相似性是跨模态检索的关键。图像和文本等不同模态具有不平衡且互补的关系，在描述相同语义时它们包含的信息量不相等。例如，图像通常包含更多文本描述无法展示的细节，反之亦然。现有的基于深度神经网络（DNN）的工作大多为不同模态构建一个公共空间，以找到它们之间的潜在对齐关系，这会丢失其独特的模态特定特征。因此，我们提出了模态特定的跨模态相似性度量（MCSM）方法，通过为每个模态构建独立的语义空间，该方法采用端到端框架直接生成模态特定的跨模态相似性，而无需显式的公共表示。对于每个语义空间，循环注意力网络充分利用一个模态内的模态特定特征，而另一个模态的数据通过基于注意力的联合嵌入投影到这个空间中，这利用学习到的注意力权重来指导细粒度的跨模态相关性学习，并捕捉不同模态之间的不平衡和互补关系。最后，通过对模态特定的跨模态相似性进行自适应融合来探索不同模态语义空间之间的互补性，以执行跨模态检索。在广泛使用的维基百科、帕斯卡句子、微软COCO数据集以及我们构建的大规模XMediaNet数据集上进行的实验验证了我们提出的方法的有效性，优于9种当前最先进的方法。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

基于循环注意力网络的模态特定跨模态相似性度量

Modality-specific Cross-modal Similarity Measurement with Recurrent Attention Network.

作者信息

出版信息

相似文献

引用本文的文献

基于循环注意力网络的模态特定跨模态相似性度量

Modality-specific Cross-modal Similarity Measurement with Recurrent Attention Network.

作者信息

出版信息

相似文献

引用本文的文献