Suppr超能文献

MemBridge:通过记忆增强跨模态桥接进行视频-语言预训练

MemBridge: Video-Language Pre-Training With Memory-Augmented Inter-Modality Bridge.

作者信息

Yang Jiahao, Li Xiangyang, Zheng Mao, Wang Zihan, Zhu Yongqing, Guo Xiaoqian, Yuan Yuchen, Chai Zifeng, Jiang Shuqiang

出版信息

IEEE Trans Image Process. 2023;32:4073-4087. doi: 10.1109/TIP.2023.3283916. Epub 2023 Jul 19.

Abstract

Video-language pre-training has attracted considerable attention recently for its promising performance on various downstream tasks. Most existing methods utilize the modality-specific or modality-joint representation architectures for the cross-modality pre-training. Different from previous methods, this paper presents a novel architecture named Memory-augmented Inter-Modality Bridge (MemBridge), which uses the learnable intermediate modality representations as the bridge for the interaction between videos and language. Specifically, in the transformer-based cross-modality encoder, we introduce the learnable bridge tokens as the interaction approach, which means the video and language tokens can only perceive information from bridge tokens and themselves. Moreover, a memory bank is proposed to store abundant modality interaction information for adaptively generating bridge tokens according to different cases, enhancing the capacity and robustness of the inter-modality bridge. Through pre-training, MemBridge explicitly models the representations for more sufficient inter-modality interaction. Comprehensive experiments show that our approach achieves competitive performance with previous methods on various downstream tasks including video-text retrieval, video captioning, and video question answering on multiple datasets, demonstrating the effectiveness of the proposed method. The code has been available at https://github.com/jahhaoyang/MemBridge.

摘要

视频-语言预训练最近因其在各种下游任务上的出色表现而备受关注。大多数现有方法在跨模态预训练中使用特定模态或模态联合表示架构。与先前方法不同,本文提出了一种名为记忆增强跨模态桥接(MemBridge)的新型架构,它使用可学习的中间模态表示作为视频与语言之间交互的桥梁。具体而言,在基于Transformer的跨模态编码器中,我们引入可学习的桥接令牌作为交互方式,这意味着视频和语言令牌只能从桥接令牌和自身感知信息。此外,还提出了一个记忆库来存储丰富的模态交互信息,以便根据不同情况自适应地生成桥接令牌,增强跨模态桥接的能力和鲁棒性。通过预训练,MemBridge明确地对表示进行建模,以实现更充分的跨模态交互。综合实验表明,我们的方法在包括视频-文本检索、视频字幕和多数据集上的视频问答等各种下游任务中与先前方法取得了有竞争力的性能,证明了所提方法的有效性。代码已在https://github.com/jahhaoyang/MemBridge上可用。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验