Shen Wenxue, Song Jingkuan, Zhu Xiaosu, Li Gongfu, Shen Heng Tao
IEEE Trans Image Process. 2023;32:5017-5030. doi: 10.1109/TIP.2023.3275071. Epub 2023 Sep 8.
Lately, video-language pre-training and text-video retrieval have attracted significant attention with the explosion of multimedia data on the Internet. However, existing approaches for video-language pre-training typically limit the exploitation of the hierarchical semantic information in videos, such as frame semantic information and global video semantic information. In this work, we present an end-to-end pre-training network with Hierarchical Matching and Momentum Contrast named HMMC. The key idea is to explore the hierarchical semantic information in videos via multilevel semantic matching between videos and texts. This design is motivated by the observation that if a video semantically matches a text (can be a title, tag or caption), the frames in this video usually have semantic connections with the text and show higher similarity than frames in other videos. Hierarchical matching is mainly realized by two proxy tasks: Video-Text Matching (VTM) and Frame-Text Matching (FTM). Another proxy task: Frame Adjacency Matching (FAM) is proposed to enhance the single visual modality representations while training from scratch. Furthermore, momentum contrast framework was introduced into HMMC to form a multimodal momentum contrast framework, enabling HMMC to incorporate more negative samples for contrastive learning which contributes to the generalization of representations. We also collected a large-scale Chinese video-language dataset (over 763k unique videos) named CHVTT to explore the multilevel semantic connections between videos and texts. Experimental results on two major Text-video retrieval benchmark datasets demonstrate the advantages of our methods. We release our code at https://github.com/cheetah003/HMMC.
近年来,随着互联网上多媒体数据的爆炸式增长,视频-语言预训练和文本-视频检索受到了广泛关注。然而,现有的视频-语言预训练方法通常限制了对视频中层次语义信息的利用,例如帧语义信息和全局视频语义信息。在这项工作中,我们提出了一种名为HMMC的具有层次匹配和动量对比的端到端预训练网络。其关键思想是通过视频与文本之间的多级语义匹配来探索视频中的层次语义信息。这种设计的灵感来源于这样的观察:如果一个视频在语义上与一段文本(可以是标题、标签或字幕)匹配,那么该视频中的帧通常与文本具有语义联系,并且比其他视频中的帧表现出更高的相似度。层次匹配主要通过两个代理任务实现:视频-文本匹配(VTM)和帧-文本匹配(FTM)。另一个代理任务:帧邻接匹配(FAM)被提出,用于在从头开始训练时增强单一视觉模态表示。此外,动量对比框架被引入到HMMC中,形成了一个多模态动量对比框架,使HMMC能够纳入更多的负样本进行对比学习,这有助于表示的泛化。我们还收集了一个名为CHVTT的大规模中文视频-语言数据集(超过763k个独特视频),以探索视频与文本之间的多级语义联系。在两个主要的文本-视频检索基准数据集上的实验结果证明了我们方法的优势。我们在https://github.com/cheetah003/HMMC上发布了我们的代码。