• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于文本-视频检索的基于层次匹配和动量对比的端到端预训练

End-to-End Pre-Training With Hierarchical Matching and Momentum Contrast for Text-Video Retrieval.

作者信息

Shen Wenxue, Song Jingkuan, Zhu Xiaosu, Li Gongfu, Shen Heng Tao

出版信息

IEEE Trans Image Process. 2023;32:5017-5030. doi: 10.1109/TIP.2023.3275071. Epub 2023 Sep 8.

DOI:10.1109/TIP.2023.3275071
PMID:37186535
Abstract

Lately, video-language pre-training and text-video retrieval have attracted significant attention with the explosion of multimedia data on the Internet. However, existing approaches for video-language pre-training typically limit the exploitation of the hierarchical semantic information in videos, such as frame semantic information and global video semantic information. In this work, we present an end-to-end pre-training network with Hierarchical Matching and Momentum Contrast named HMMC. The key idea is to explore the hierarchical semantic information in videos via multilevel semantic matching between videos and texts. This design is motivated by the observation that if a video semantically matches a text (can be a title, tag or caption), the frames in this video usually have semantic connections with the text and show higher similarity than frames in other videos. Hierarchical matching is mainly realized by two proxy tasks: Video-Text Matching (VTM) and Frame-Text Matching (FTM). Another proxy task: Frame Adjacency Matching (FAM) is proposed to enhance the single visual modality representations while training from scratch. Furthermore, momentum contrast framework was introduced into HMMC to form a multimodal momentum contrast framework, enabling HMMC to incorporate more negative samples for contrastive learning which contributes to the generalization of representations. We also collected a large-scale Chinese video-language dataset (over 763k unique videos) named CHVTT to explore the multilevel semantic connections between videos and texts. Experimental results on two major Text-video retrieval benchmark datasets demonstrate the advantages of our methods. We release our code at https://github.com/cheetah003/HMMC.

摘要

近年来,随着互联网上多媒体数据的爆炸式增长,视频-语言预训练和文本-视频检索受到了广泛关注。然而,现有的视频-语言预训练方法通常限制了对视频中层次语义信息的利用,例如帧语义信息和全局视频语义信息。在这项工作中,我们提出了一种名为HMMC的具有层次匹配和动量对比的端到端预训练网络。其关键思想是通过视频与文本之间的多级语义匹配来探索视频中的层次语义信息。这种设计的灵感来源于这样的观察:如果一个视频在语义上与一段文本(可以是标题、标签或字幕)匹配,那么该视频中的帧通常与文本具有语义联系,并且比其他视频中的帧表现出更高的相似度。层次匹配主要通过两个代理任务实现:视频-文本匹配(VTM)和帧-文本匹配(FTM)。另一个代理任务:帧邻接匹配(FAM)被提出,用于在从头开始训练时增强单一视觉模态表示。此外,动量对比框架被引入到HMMC中,形成了一个多模态动量对比框架,使HMMC能够纳入更多的负样本进行对比学习,这有助于表示的泛化。我们还收集了一个名为CHVTT的大规模中文视频-语言数据集(超过763k个独特视频),以探索视频与文本之间的多级语义联系。在两个主要的文本-视频检索基准数据集上的实验结果证明了我们方法的优势。我们在https://github.com/cheetah003/HMMC上发布了我们的代码。

相似文献

1
End-to-End Pre-Training With Hierarchical Matching and Momentum Contrast for Text-Video Retrieval.用于文本-视频检索的基于层次匹配和动量对比的端到端预训练
IEEE Trans Image Process. 2023;32:5017-5030. doi: 10.1109/TIP.2023.3275071. Epub 2023 Sep 8.
2
Query-Adaptive Late Fusion for Hierarchical Fine-Grained Video-Text Retrieval.用于分层细粒度视频-文本检索的查询自适应晚期融合
IEEE Trans Neural Netw Learn Syst. 2022 Oct 24;PP. doi: 10.1109/TNNLS.2022.3214208.
3
USER: Unified Semantic Enhancement With Momentum Contrast for Image-Text Retrieval.用于图像-文本检索的基于动量对比的统一语义增强
IEEE Trans Image Process. 2024;33:595-609. doi: 10.1109/TIP.2023.3348297. Epub 2024 Jan 10.
4
Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval.用于跨模态视频检索的语义感知时空二进制编码
IEEE Trans Image Process. 2021;30:2989-3004. doi: 10.1109/TIP.2020.3048680. Epub 2021 Feb 18.
5
Efficient Token-Guided Image-Text Retrieval With Consistent Multimodal Contrastive Training.高效的基于令牌的图像-文本检索与一致的多模态对比训练。
IEEE Trans Image Process. 2023;32:3622-3633. doi: 10.1109/TIP.2023.3286710. Epub 2023 Jul 3.
6
A new design of multimedia big data retrieval enabled by deep feature learning and Adaptive Semantic Similarity Function.一种由深度特征学习和自适应语义相似性函数实现的多媒体大数据检索新设计。
Multimed Syst. 2022;28(3):1039-1058. doi: 10.1007/s00530-022-00897-8. Epub 2022 Feb 5.
7
A cross-modal conditional mechanism based on attention for text-video retrieval.
Math Biosci Eng. 2023 Nov 3;20(11):20073-20092. doi: 10.3934/mbe.2023889.
8
MemBridge: Video-Language Pre-Training With Memory-Augmented Inter-Modality Bridge.MemBridge:通过记忆增强跨模态桥接进行视频-语言预训练
IEEE Trans Image Process. 2023;32:4073-4087. doi: 10.1109/TIP.2023.3283916. Epub 2023 Jul 19.
9
Concept-Aware Video Captioning: Describing Videos With Effective Prior Information.概念感知视频字幕:利用有效先验信息描述视频
IEEE Trans Image Process. 2023;32:5366-5378. doi: 10.1109/TIP.2023.3307969. Epub 2023 Oct 2.
10
Deep Semantic Multimodal Hashing Network for Scalable Image-Text and Video-Text Retrievals.用于可扩展图像-文本和视频-文本检索的深度语义多模态哈希网络
IEEE Trans Neural Netw Learn Syst. 2023 Apr;34(4):1838-1851. doi: 10.1109/TNNLS.2020.2997020. Epub 2023 Apr 4.