Yu Ting, Yu Jun, Yu Zhou, Tao Dacheng
IEEE Trans Image Process. 2019 Sep 16. doi: 10.1109/TIP.2019.2940677.
Given a video, Video Question Answering (VideoQA) aims at answering arbitrary free-form questions about the video content in natural language. A successful VideoQA framework usually has the following two key components: 1) a discriminative video encoder that learns the effective video representation to maintain as much information as possible about the video and 2) a question-guided decoder that learns to select the most related features to perform spatiotemporal reasoning, as well as outputs the correct answer. We propose compositional attention networks (CAN) with two-stream fusion for VideoQA tasks. For the encoder, we sample video snippets using a two-stream mechanism (i.e., a uniform sampling stream and an action pooling stream) and extract a sequence of visual features for each stream to represent the video semantics with implementation. For the decoder, we propose a compositional attention module to integrate the two-stream features with the attention mechanism. The compositional attention module is the core of CAN and can be seen as a modular combination of a unified attention block. With different fusion strategies, we devise five compositional attention module variants. We evaluate our approach on one long-term VideoQA dataset, ActivityNet-QA, and two short-term VideoQA datasets, MSRVTT-QA and MSVD-QA. Our CAN model achieves new state-of-the-art results on all the datasets.
给定一个视频,视频问答(VideoQA)旨在用自然语言回答关于视频内容的任意自由形式的问题。一个成功的视频问答框架通常有以下两个关键组件:1)一个判别式视频编码器,它学习有效的视频表示,以尽可能多地保留关于视频的信息;2)一个问题引导的解码器,它学习选择最相关的特征来进行时空推理,并输出正确答案。我们提出了用于视频问答任务的具有双流融合的组合注意力网络(CAN)。对于编码器,我们使用双流机制(即均匀采样流和动作池化流)对视频片段进行采样,并为每个流提取一系列视觉特征,以实现方式表示视频语义。对于解码器,我们提出了一个组合注意力模块,通过注意力机制整合双流特征。组合注意力模块是CAN的核心,可以看作是一个统一注意力块的模块化组合。通过不同的融合策略,我们设计了五种组合注意力模块变体。我们在一个长期视频问答数据集ActivityNet-QA和两个短期视频问答数据集MSRVTT-QA和MSVD-QA上评估了我们的方法。我们的CAN模型在所有数据集上都取得了新的最优结果。