Yang Jiahao, Li Xiangyang, Zheng Mao, Wang Zihan, Zhu Yongqing, Guo Xiaoqian, Yuan Yuchen, Chai Zifeng, Jiang Shuqiang
IEEE Trans Image Process. 2023;32:4073-4087. doi: 10.1109/TIP.2023.3283916. Epub 2023 Jul 19.
Video-language pre-training has attracted considerable attention recently for its promising performance on various downstream tasks. Most existing methods utilize the modality-specific or modality-joint representation architectures for the cross-modality pre-training. Different from previous methods, this paper presents a novel architecture named Memory-augmented Inter-Modality Bridge (MemBridge), which uses the learnable intermediate modality representations as the bridge for the interaction between videos and language. Specifically, in the transformer-based cross-modality encoder, we introduce the learnable bridge tokens as the interaction approach, which means the video and language tokens can only perceive information from bridge tokens and themselves. Moreover, a memory bank is proposed to store abundant modality interaction information for adaptively generating bridge tokens according to different cases, enhancing the capacity and robustness of the inter-modality bridge. Through pre-training, MemBridge explicitly models the representations for more sufficient inter-modality interaction. Comprehensive experiments show that our approach achieves competitive performance with previous methods on various downstream tasks including video-text retrieval, video captioning, and video question answering on multiple datasets, demonstrating the effectiveness of the proposed method. The code has been available at https://github.com/jahhaoyang/MemBridge.
视频-语言预训练最近因其在各种下游任务上的出色表现而备受关注。大多数现有方法在跨模态预训练中使用特定模态或模态联合表示架构。与先前方法不同,本文提出了一种名为记忆增强跨模态桥接(MemBridge)的新型架构,它使用可学习的中间模态表示作为视频与语言之间交互的桥梁。具体而言,在基于Transformer的跨模态编码器中,我们引入可学习的桥接令牌作为交互方式,这意味着视频和语言令牌只能从桥接令牌和自身感知信息。此外,还提出了一个记忆库来存储丰富的模态交互信息,以便根据不同情况自适应地生成桥接令牌,增强跨模态桥接的能力和鲁棒性。通过预训练,MemBridge明确地对表示进行建模,以实现更充分的跨模态交互。综合实验表明,我们的方法在包括视频-文本检索、视频字幕和多数据集上的视频问答等各种下游任务中与先前方法取得了有竞争力的性能,证明了所提方法的有效性。代码已在https://github.com/jahhaoyang/MemBridge上可用。