Suppr超能文献

用于图像-文本匹配的自适应潜在图表示学习

Adaptive Latent Graph Representation Learning for Image-Text Matching.

作者信息

Tian Mengxiao, Wu Xinxiao, Jia Yunde

出版信息

IEEE Trans Image Process. 2023;32:471-482. doi: 10.1109/TIP.2022.3229631. Epub 2022 Dec 30.

Abstract

Image-text matching is a challenging task due to the modality gap. Many recent methods focus on modeling entity relationships to learn a common embedding space of image and text. However, these methods suffer from distractions of entity relationships such as irrelevant visual regions in an image and noisy textual words in a text. In this paper, we propose an adaptive latent graph representation learning method to reduce the distractions of entity relationships for image-text matching. Specifically, we use an improved graph variational autoencoder to separate the distracting factors and latent factor of relationships and jointly learn latent textual graph representations, latent visual graph representations, and a visual-textual graph embedding space. We also introduce an adaptive cross-attention mechanism to perform feature attending on the latent graph representations across images and texts, thus further narrowing the modality gap to boost the matching performance. Extensive experiments on two public datasets, Flickr30K and COCO, show the effectiveness of our method.

摘要

由于模态差距,图像-文本匹配是一项具有挑战性的任务。许多近期的方法专注于对实体关系进行建模,以学习图像和文本的公共嵌入空间。然而,这些方法受到实体关系干扰的影响,如图像中不相关的视觉区域和文本中嘈杂的文本单词。在本文中,我们提出了一种自适应潜在图表示学习方法,以减少用于图像-文本匹配的实体关系干扰。具体而言,我们使用改进的图变分自编码器来分离关系的干扰因素和潜在因素,并联合学习潜在文本图表示、潜在视觉图表示以及视觉-文本图嵌入空间。我们还引入了一种自适应交叉注意力机制,对跨图像和文本的潜在图表示进行特征关注,从而进一步缩小模态差距以提高匹配性能。在两个公共数据集Flickr30K和COCO上进行的大量实验表明了我们方法的有效性。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验