• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过视觉和文本语义推理进行图像-文本嵌入学习

Image-Text Embedding Learning via Visual and Textual Semantic Reasoning.

作者信息

Li Kunpeng, Zhang Yulun, Li Kai, Li Yuanyuan, Fu Yun

出版信息

IEEE Trans Pattern Anal Mach Intell. 2023 Jan;45(1):641-656. doi: 10.1109/TPAMI.2022.3148470. Epub 2022 Dec 5.

DOI:10.1109/TPAMI.2022.3148470
PMID:35130144
Abstract

As a bridge between language and vision domains, cross-modal retrieval between images and texts is a hot research topic in recent years. It remains challenging because the current image representations usually lack semantic concepts in the corresponding sentence captions. To address this issue, we introduce an intuitive and interpretable model to learn a common embedding space for alignments between images and text descriptions. Specifically, our model first incorporates the semantic relationship information into visual and textual features by performing region or word relationship reasoning. Then it utilizes the gate and memory mechanism to perform global semantic reasoning on these relationship-enhanced features, select the discriminative information and gradually grow representations for the whole scene. Through the alignment learning, the learned visual representations capture key objects and semantic concepts of a scene as in the corresponding text caption. Experiments on MS-COCO [1] and Flickr30K [2] datasets validate that our method surpasses many recent state-of-the-arts with a clear margin. In addition to the effectiveness, our methods are also very efficient at the inference stage. Thanks to the effective overall representation learning with visual semantic reasoning, our methods can already achieve very strong performance by only relying on the simple inner-product to obtain similarity scores between images and captions. Experiments validate the proposed methods are more than 30-75 times faster than many recent methods with code public available. Instead of following the recent trend of using complex local matching strategies [3], [4], [5], [6] to pursue good performance while sacrificing efficiency, we show that the simple global matching strategy can still be very effective, efficient and achieve even better performance based on our framework.

摘要

作为语言和视觉领域之间的桥梁,图像与文本之间的跨模态检索是近年来的一个热门研究课题。它仍然具有挑战性,因为当前的图像表示通常在相应的句子描述中缺乏语义概念。为了解决这个问题,我们引入了一个直观且可解释的模型,以学习用于图像与文本描述对齐的公共嵌入空间。具体来说,我们的模型首先通过执行区域或单词关系推理,将语义关系信息纳入视觉和文本特征中。然后,它利用门控和记忆机制对这些关系增强的特征进行全局语义推理,选择有区分性的信息,并逐步生成整个场景的表示。通过对齐学习,所学习到的视觉表示能够捕捉场景中的关键对象和语义概念,就像在相应的文本描述中一样。在MS-COCO [1]和Flickr30K [2]数据集上的实验验证了我们的方法以明显优势超越了许多近期的先进方法。除了有效性之外,我们的方法在推理阶段也非常高效。由于通过视觉语义推理进行了有效的整体表示学习,我们的方法仅依靠简单的内积来获得图像与标题之间的相似度得分,就已经能够实现非常强大的性能。实验验证了所提出的方法比许多近期公开代码的方法快30至75倍以上。我们没有遵循近期使用复杂局部匹配策略[3]、[4]、[5]、[6]来追求良好性能却牺牲效率的趋势,而是表明简单的全局匹配策略在我们的框架下仍然可以非常有效、高效,甚至能实现更好的性能。

相似文献

1
Image-Text Embedding Learning via Visual and Textual Semantic Reasoning.通过视觉和文本语义推理进行图像-文本嵌入学习
IEEE Trans Pattern Anal Mach Intell. 2023 Jan;45(1):641-656. doi: 10.1109/TPAMI.2022.3148470. Epub 2022 Dec 5.
2
Topic-Oriented Image Captioning Based on Order-Embedding.基于序嵌入的主题导向图像字幕生成
IEEE Trans Image Process. 2019 Jun;28(6):2743-2754. doi: 10.1109/TIP.2018.2889922. Epub 2018 Dec 27.
3
Learning Relationship-Enhanced Semantic Graph for Fine-Grained Image-Text Matching.用于细粒度图像-文本匹配的学习关系增强语义图
IEEE Trans Cybern. 2024 Feb;54(2):948-961. doi: 10.1109/TCYB.2022.3179020. Epub 2024 Jan 17.
4
Adaptive Latent Graph Representation Learning for Image-Text Matching.用于图像-文本匹配的自适应潜在图表示学习
IEEE Trans Image Process. 2023;32:471-482. doi: 10.1109/TIP.2022.3229631. Epub 2022 Dec 30.
5
Learning Aligned Image-Text Representations Using Graph Attentive Relational Network.使用图注意力关系网络学习对齐的图像-文本表示
IEEE Trans Image Process. 2021;30:1840-1852. doi: 10.1109/TIP.2020.3048627. Epub 2021 Jan 18.
6
Cross-Modal Attention With Semantic Consistence for Image-Text Matching.用于图像-文本匹配的具有语义一致性的跨模态注意力机制
IEEE Trans Neural Netw Learn Syst. 2020 Dec;31(12):5412-5425. doi: 10.1109/TNNLS.2020.2967597. Epub 2020 Nov 30.
7
Visual context learning based on textual knowledge for image-text retrieval.基于文本知识的视觉上下文学习用于图像-文本检索。
Neural Netw. 2022 Aug;152:434-449. doi: 10.1016/j.neunet.2022.05.008. Epub 2022 May 18.
8
Unsupervised Visual-Textual Correlation Learning With Fine-Grained Semantic Alignment.无监督视觉-文本关联学习与细粒度语义对齐。
IEEE Trans Cybern. 2022 May;52(5):3669-3683. doi: 10.1109/TCYB.2020.3015084. Epub 2022 May 19.
9
Context-Fused Guidance for Image Captioning Using Sequence-Level Training.基于序列级训练的上下文融合图像字幕生成
Comput Intell Neurosci. 2022 Jan 5;2022:9743123. doi: 10.1155/2022/9743123. eCollection 2022.
10
On the Limitations of Visual-Semantic Embedding Networks for Image-to-Text Information Retrieval.论视觉语义嵌入网络在图像到文本信息检索中的局限性
J Imaging. 2021 Jul 26;7(8):125. doi: 10.3390/jimaging7080125.

引用本文的文献

1
Novel cross-dimensional coarse-fine-grained complementary network for image-text matching.用于图像-文本匹配的新型跨维度粗细粒度互补网络。
PeerJ Comput Sci. 2025 Mar 3;11:e2725. doi: 10.7717/peerj-cs.2725. eCollection 2025.
2
HAAN: Learning a Hierarchical Adaptive Alignment Network for Image-Text Retrieval.学习一种用于图像-文本检索的分层自适应对齐网络。
Sensors (Basel). 2023 Feb 25;23(5):2559. doi: 10.3390/s23052559.