• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

高效的基于令牌的图像-文本检索与一致的多模态对比训练。

Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training.

出版信息

IEEE Trans Image Process. 2023;32:3622-3633. doi: 10.1109/TIP.2023.3286710. Epub 2023 Jul 3.

DOI:10.1109/TIP.2023.3286710
PMID:37339023
Abstract

Image-text retrieval is a central problem for understanding the semantic relationship between vision and language, and serves as the basis for various visual and language tasks. Most previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words. However, the close relations between coarse- and fine-grained representations for each modality are important for image-text retrieval but almost neglected. As a result, such previous works inevitably suffer from low retrieval accuracy or heavy computational cost. In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework. This framework is consistent with human cognition, as humans simultaneously pay attention to the entire sample and regional elements to understand the semantic content. To this end, a Token-Guided Dual Transformer (TGDT) architecture which consists of two homogeneous branches for image and text modalities, respectively, is proposed for image-text retrieval. The TGDT incorporates both coarse- and fine-grained retrievals into a unified framework and beneficially leverages the advantages of both retrieval approaches. A novel training objective called Consistent Multimodal Contrastive (CMC) loss is proposed accordingly to ensure the intra- and inter-modal semantic consistencies between images and texts in the common embedding space. Equipped with a two-stage inference method based on the mixed global and local cross-modal similarity, the proposed method achieves state-of-the-art retrieval performances with extremely low inference time when compared with representative recent approaches. Code is publicly available: github.com/LCFractal/TGDT.

摘要

图像-文本检索是理解视觉和语言之间语义关系的核心问题,也是各种视觉和语言任务的基础。以前的大多数工作要么简单地学习整体图像和文本的粗粒度表示,要么精心建立图像区域或像素与文本单词之间的对应关系。然而,两种模态的粗粒度和细粒度表示之间的密切关系对于图像-文本检索很重要,但几乎被忽略了。因此,以前的这些工作不可避免地存在检索精度低或计算成本高的问题。在这项工作中,我们从一个新的角度来解决图像-文本检索问题,即将粗粒度和细粒度的表示学习结合到一个统一的框架中。这个框架与人类的认知是一致的,因为人类同时关注整个样本和区域元素来理解语义内容。为此,我们提出了一种基于 Token-Guided Dual Transformer (TGDT) 的架构,它由两个分别用于图像和文本模态的同构分支组成,用于图像-文本检索。TGDT 将粗粒度和细粒度检索纳入一个统一的框架中,并有利地利用了这两种检索方法的优势。相应地提出了一种新的训练目标,称为一致多模态对比(CMC)损失,以确保在公共嵌入空间中图像和文本之间的内模态和间模态语义一致性。通过基于混合全局和局部跨模态相似性的两阶段推理方法,与代表性的最新方法相比,该方法在具有极低推理时间的情况下实现了最先进的检索性能。代码可在 github.com/LCFractal/TGDT 上获得。

相似文献

1
Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training.高效的基于令牌的图像-文本检索与一致的多模态对比训练。
IEEE Trans Image Process. 2023;32:3622-3633. doi: 10.1109/TIP.2023.3286710. Epub 2023 Jul 3.
2
Relation-Aggregated Cross-Graph Correlation Learning for Fine-Grained Image-Text Retrieval.用于细粒度图像-文本检索的关系聚合跨图相关性学习
IEEE Trans Neural Netw Learn Syst. 2024 Feb;35(2):2194-2207. doi: 10.1109/TNNLS.2022.3188569. Epub 2024 Feb 5.
3
Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective.从多任务视角看自然保护图像数据中的细粒度跨模态语义一致性
Sensors (Basel). 2024 May 14;24(10):3130. doi: 10.3390/s24103130.
4
Latent Space Semantic Supervision Based on Knowledge Distillation for Cross-Modal Retrieval.基于知识蒸馏的潜在空间语义监督用于跨模态检索
IEEE Trans Image Process. 2022;31:7154-7164. doi: 10.1109/TIP.2022.3220051. Epub 2022 Nov 16.
5
Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals.用于可扩展图像-文本和视频-文本检索的深度语义多模态哈希网络
IEEE Trans Neural Netw Learn Syst. 2023 Apr;34(4):1838-1851. doi: 10.1109/TNNLS.2020.2997020. Epub 2023 Apr 4.
6
Memorize, Associate and Match: Embedding Enhancement via Fine-Grained Alignment for Image-Text Retrieval.记忆、关联与匹配:通过细粒度对齐进行图像-文本检索的嵌入增强
IEEE Trans Image Process. 2021;30:9193-9207. doi: 10.1109/TIP.2021.3123553. Epub 2021 Nov 10.
7
CLIP-Driven Fine-Grained Text-Image Person Re-Identification.基于CLIP的细粒度文本-图像人物重识别
IEEE Trans Image Process. 2023;32:6032-6046. doi: 10.1109/TIP.2023.3327924. Epub 2023 Nov 7.
8
Universal Multimodal Representation for Language Understanding.通用多模态语言理解表示
IEEE Trans Pattern Anal Mach Intell. 2023 Jul;45(7):9169-9185. doi: 10.1109/TPAMI.2023.3234170. Epub 2023 Jun 5.
9
USER: Unified Semantic Enhancement With Momentum Contrast for Image-Text Retrieval.用于图像-文本检索的基于动量对比的统一语义增强
IEEE Trans Image Process. 2024;33:595-609. doi: 10.1109/TIP.2023.3348297. Epub 2024 Jan 10.
10
Histopathology language-image representation learning for fine-grained digital pathology cross-modal retrieval.用于细粒度数字病理学跨模态检索的组织病理学语言-图像表示学习
Med Image Anal. 2024 Jul;95:103163. doi: 10.1016/j.media.2024.103163. Epub 2024 Apr 9.