Suppr超能文献

为无配对视频字幕对齐源视觉和目标语言领域

Aligning Source Visual and Target Language Domains for Unpaired Video Captioning.

作者信息

Liu Fenglin, Wu Xian, You Chenyu, Ge Shen, Zou Yuexian, Sun Xu

出版信息

IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):9255-9268. doi: 10.1109/TPAMI.2021.3132229. Epub 2022 Nov 7.

Abstract

Training supervised video captioning model requires coupled video-caption pairs. However, for many targeted languages, sufficient paired data are not available. To this end, we introduce the unpaired video captioning task aiming to train models without coupled video-caption pairs in target language. To solve the task, a natural choice is to employ a two-step pipeline system: first utilizing video-to-pivot captioning model to generate captions in pivot language and then utilizing pivot-to-target translation model to translate the pivot captions to the target language. However, in such a pipeline system, 1) visual information cannot reach the translation model, generating visual irrelevant target captions; 2) the errors in the generated pivot captions will be propagated to the translation model, resulting in disfluent target captions. To address these problems, we propose the Unpaired Video Captioning with Visual Injection system (UVC-VI). UVC-VI first introduces the Visual Injection Module (VIM), which aligns source visual and target language domains to inject the source visual information into the target language domain. Meanwhile, VIM directly connects the encoder of the video-to-pivot model and the decoder of the pivot-to-target model, allowing end-to-end inference by completely skipping the generation of pivot captions. To enhance the cross-modality injection of the VIM, UVC-VI further introduces a pluggable video encoder, i.e., Multimodal Collaborative Encoder (MCE). The experiments show that UVC-VI outperforms pipeline systems and exceeds several supervised systems. Furthermore, equipping existing supervised systems with our MCE can achieve 4% and 7% relative margins on the CIDEr scores to current state-of-the-art models on the benchmark MSVD and MSR-VTT datasets, respectively.

摘要

训练有监督的视频字幕模型需要视频与字幕的配对数据。然而,对于许多目标语言来说,没有足够的配对数据。为此,我们引入了无配对视频字幕任务,旨在在没有目标语言的视频与字幕配对数据的情况下训练模型。为了解决这个任务,一个自然的选择是采用两步流水线系统:首先利用视频到枢轴语言字幕模型生成枢轴语言的字幕,然后利用枢轴语言到目标语言的翻译模型将枢轴语言字幕翻译成目标语言。然而,在这样的流水线系统中,1)视觉信息无法到达翻译模型,从而生成与视觉无关的目标语言字幕;2)生成的枢轴语言字幕中的错误会传播到翻译模型,导致目标语言字幕不流畅。为了解决这些问题,我们提出了带视觉注入的无配对视频字幕系统(UVC-VI)。UVC-VI首先引入了视觉注入模块(VIM),该模块对齐源视觉和目标语言域,将源视觉信息注入目标语言域。同时,VIM直接连接视频到枢轴模型的编码器和枢轴到目标模型的解码器,通过完全跳过枢轴语言字幕的生成实现端到端推理。为了增强VIM的跨模态注入,UVC-VI进一步引入了一个可插拔的视频编码器,即多模态协作编码器(MCE)。实验表明,UVC-VI优于流水线系统,并且超过了几个有监督的系统。此外,在现有有监督系统中配备我们的MCE,在基准MSVD和MSR-VTT数据集上,相对于当前的最先进模型,在CIDEr分数上分别可以实现4%和7%的相对提升。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验