用于图像-句子匹配的解耦跨模态短语注意力网络

Decoupled Cross-Modal Phrase-Attention Network for Image-Sentence Matching.

作者信息

Shi Zhangxiang, Zhang Tianzhu, Wei Xi, Wu Feng, Zhang Yongdong

出版信息

IEEE Trans Image Process. 2024;33:1326-1337. doi: 10.1109/TIP.2022.3197972. Epub 2024 Feb 13.

DOI:10.1109/TIP.2022.3197972

Abstract

The mainstream of image and sentence matching studies currently focuses on fine-grained alignment of image regions and sentence words. However, these methods miss a crucial fact: the correspondence between images and sentences does not simply come from alignments between individual regions and words but from alignments between the phrases they form respectively. In this work, we propose a novel Decoupled Cross-modal Phrase-Attention network (DCPA) for image-sentence matching by modeling the relationships between textual phrases and visual phrases. Furthermore, we design a novel decoupled manner for training and inferencing, which is able to release the trade-off for bi-directional retrieval, where image-to-sentence matching is executed in textual semantic space and sentence-to-image matching is executed in visual semantic space. Extensive experimental results on Flickr30K and MS-COCO demonstrate that the proposed method outperforms state-of-the-art methods by a large margin, and can compete with some methods introducing external knowledge.

摘要

目前，图像与句子匹配研究的主流方向聚焦于图像区域与句子词汇的细粒度对齐。然而，这些方法忽略了一个关键事实：图像与句子之间的对应关系并非仅仅源于单个区域与词汇之间的对齐，而是来自它们各自所形成的短语之间的对齐。在这项工作中，我们提出了一种新颖的解耦跨模态短语注意力网络（DCPA），通过对文本短语和视觉短语之间的关系进行建模来实现图像与句子的匹配。此外，我们设计了一种新颖的解耦训练和推理方式，能够消除双向检索中的权衡，其中图像到句子的匹配在文本语义空间中执行，句子到图像的匹配在视觉语义空间中执行。在Flickr30K和MS-COCO上的大量实验结果表明，所提出的方法大大优于现有方法，并且可以与一些引入外部知识的方法相媲美。

相似文献

Decoupled Cross-Modal Phrase-Attention Network for Image-Sentence Matching.用于图像-句子匹配的解耦跨模态短语注意力网络

IEEE Trans Image Process. 2024;33:1326-1337. doi: 10.1109/TIP.2022.3197972. Epub 2024 Feb 13.

Cross-Modal Attention With Semantic Consistence for Image-Text Matching.用于图像-文本匹配的具有语义一致性的跨模态注意力机制

IEEE Trans Neural Netw Learn Syst. 2020 Dec;31(12):5412-5425. doi: 10.1109/TNNLS.2020.2967597. Epub 2020 Nov 30.

MAVA: Multi-level Adaptive Visual-textual Alignment by Cross-media Bi-attention Mechanism.MAVA：基于跨媒体双向注意力机制的多层次自适应视觉文本对齐

IEEE Trans Image Process. 2019 Nov 22. doi: 10.1109/TIP.2019.2952085.

Latent Space Semantic Supervision Based on Knowledge Distillation for Cross-Modal Retrieval.基于知识蒸馏的潜在空间语义监督用于跨模态检索

IEEE Trans Image Process. 2022;31:7154-7164. doi: 10.1109/TIP.2022.3220051. Epub 2022 Nov 16.

Learning Aligned Image-Text Representations Using Graph Attentive Relational Network.使用图注意力关系网络学习对齐的图像-文本表示

IEEE Trans Image Process. 2021;30:1840-1852. doi: 10.1109/TIP.2020.3048627. Epub 2021 Jan 18.

Unsupervised Visual-Textual Correlation Learning With Fine-Grained Semantic Alignment.无监督视觉-文本关联学习与细粒度语义对齐。

IEEE Trans Cybern. 2022 May;52(5):3669-3683. doi: 10.1109/TCYB.2020.3015084. Epub 2022 May 19.

Learning Two-Branch Neural Networks for Image-Text Matching Tasks.学习用于图像-文本匹配任务的双分支神经网络。

IEEE Trans Pattern Anal Mach Intell. 2019 Feb;41(2):394-407. doi: 10.1109/TPAMI.2018.2797921. Epub 2018 Jan 24.

Few-Shot Image and Sentence Matching via Aligned Cross-Modal Memory.通过对齐跨模态记忆实现少样本图像与句子匹配

IEEE Trans Pattern Anal Mach Intell. 2022 Jun;44(6):2968-2983. doi: 10.1109/TPAMI.2021.3052490. Epub 2022 May 5.

BCAN: Bidirectional Correct Attention Network for Cross-Modal Retrieval.BCAN：用于跨模态检索的双向校正注意力网络。

IEEE Trans Neural Netw Learn Syst. 2024 Oct;35(10):14247-14258. doi: 10.1109/TNNLS.2023.3276796. Epub 2024 Oct 7.

Deep Relation Embedding for Cross-Modal Retrieval.深度关系嵌入的跨模态检索。

IEEE Trans Image Process. 2021;30:617-627. doi: 10.1109/TIP.2020.3038354. Epub 2020 Dec 1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

用于图像-句子匹配的解耦跨模态短语注意力网络

Decoupled Cross-Modal Phrase-Attention Network for Image-Sentence Matching.

作者信息

出版信息

相似文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献