MemBridge：通过记忆增强跨模态桥接进行视频-语言预训练

MemBridge: Video-Language Pre-Training With Memory-Augmented Inter-Modality Bridge.

作者信息

Yang Jiahao, Li Xiangyang, Zheng Mao, Wang Zihan, Zhu Yongqing, Guo Xiaoqian, Yuan Yuchen, Chai Zifeng, Jiang Shuqiang

出版信息

IEEE Trans Image Process. 2023;32:4073-4087. doi: 10.1109/TIP.2023.3283916. Epub 2023 Jul 19.

DOI:10.1109/TIP.2023.3283916

Abstract

Video-language pre-training has attracted considerable attention recently for its promising performance on various downstream tasks. Most existing methods utilize the modality-specific or modality-joint representation architectures for the cross-modality pre-training. Different from previous methods, this paper presents a novel architecture named Memory-augmented Inter-Modality Bridge (MemBridge), which uses the learnable intermediate modality representations as the bridge for the interaction between videos and language. Specifically, in the transformer-based cross-modality encoder, we introduce the learnable bridge tokens as the interaction approach, which means the video and language tokens can only perceive information from bridge tokens and themselves. Moreover, a memory bank is proposed to store abundant modality interaction information for adaptively generating bridge tokens according to different cases, enhancing the capacity and robustness of the inter-modality bridge. Through pre-training, MemBridge explicitly models the representations for more sufficient inter-modality interaction. Comprehensive experiments show that our approach achieves competitive performance with previous methods on various downstream tasks including video-text retrieval, video captioning, and video question answering on multiple datasets, demonstrating the effectiveness of the proposed method. The code has been available at https://github.com/jahhaoyang/MemBridge.

摘要

视频-语言预训练最近因其在各种下游任务上的出色表现而备受关注。大多数现有方法在跨模态预训练中使用特定模态或模态联合表示架构。与先前方法不同，本文提出了一种名为记忆增强跨模态桥接（MemBridge）的新型架构，它使用可学习的中间模态表示作为视频与语言之间交互的桥梁。具体而言，在基于Transformer的跨模态编码器中，我们引入可学习的桥接令牌作为交互方式，这意味着视频和语言令牌只能从桥接令牌和自身感知信息。此外，还提出了一个记忆库来存储丰富的模态交互信息，以便根据不同情况自适应地生成桥接令牌，增强跨模态桥接的能力和鲁棒性。通过预训练，MemBridge明确地对表示进行建模，以实现更充分的跨模态交互。综合实验表明，我们的方法在包括视频-文本检索、视频字幕和多数据集上的视频问答等各种下游任务中与先前方法取得了有竞争力的性能，证明了所提方法的有效性。代码已在https://github.com/jahhaoyang/MemBridge上可用。

相似文献

MemBridge: Video-Language Pre-Training With Memory-Augmented Inter-Modality Bridge.MemBridge：通过记忆增强跨模态桥接进行视频-语言预训练

IEEE Trans Image Process. 2023;32:4073-4087. doi: 10.1109/TIP.2023.3283916. Epub 2023 Jul 19.

ITransformer: Intra- and Inter-Relation Embedding Transformer for TV Show Captioning.Transformer：用于电视字幕的内关系和外关系嵌入 Transformer

IEEE Trans Image Process. 2022;31:3565-3577. doi: 10.1109/TIP.2022.3159472. Epub 2022 May 26.

Hierarchical Representation Network With Auxiliary Tasks for Video Captioning and Video Question Answering.用于视频字幕和视频问答的带有辅助任务的分层表示网络

IEEE Trans Image Process. 2022;31:202-215. doi: 10.1109/TIP.2021.3120867. Epub 2021 Dec 3.

End-to-End Pre-Training With Hierarchical Matching and Momentum Contrast for Text-Video Retrieval.用于文本-视频检索的基于层次匹配和动量对比的端到端预训练

IEEE Trans Image Process. 2023;32:5017-5030. doi: 10.1109/TIP.2023.3275071. Epub 2023 Sep 8.

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset.VALOR：视听语言全感知预训练模型与数据集

IEEE Trans Pattern Anal Mach Intell. 2025 Feb;47(2):708-724. doi: 10.1109/TPAMI.2024.3479776. Epub 2025 Jan 9.

Adaptive Spatio-Temporal Graph Enhanced Vision-Language Representation for Video QA.用于视频问答的自适应时空图增强视觉语言表示

IEEE Trans Image Process. 2021;30:5477-5489. doi: 10.1109/TIP.2021.3076556. Epub 2021 Jun 11.

Visual Commonsense-Aware Representation Network for Video Captioning.用于视频字幕的视觉常识感知表示网络。

IEEE Trans Neural Netw Learn Syst. 2025 Jan;36(1):1092-1103. doi: 10.1109/TNNLS.2023.3323491. Epub 2025 Jan 7.

RetroCaptioner: beyond attention in end-to-end retrosynthesis transformer via contrastively captioned learnable graph representation.RetroCaptioner：通过对比标注的可学习图表示，在端到端逆合成变换器中超越注意力。

Bioinformatics. 2024 Sep 2;40(9). doi: 10.1093/bioinformatics/btae561.

Concept-Aware Video Captioning: Describing Videos With Effective Prior Information.概念感知视频字幕：利用有效先验信息描述视频

IEEE Trans Image Process. 2023;32:5366-5378. doi: 10.1109/TIP.2023.3307969. Epub 2023 Oct 2.

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.BioBERT：一种用于生物医学文本挖掘的预训练生物医学语言表示模型。

Bioinformatics. 2020 Feb 15;36(4):1234-1240. doi: 10.1093/bioinformatics/btz682.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

MemBridge：通过记忆增强跨模态桥接进行视频-语言预训练

MemBridge: Video-Language Pre-Training With Memory-Augmented Inter-Modality Bridge.

作者信息

出版信息

相似文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献