用于视频问答的双流融合组合注意力网络。

Compositional Attention Networks with Two-Stream Fusion for Video Question Answering.

作者信息

Yu Ting, Yu Jun, Yu Zhou, Tao Dacheng

出版信息

IEEE Trans Image Process. 2019 Sep 16. doi: 10.1109/TIP.2019.2940677.

DOI:10.1109/TIP.2019.2940677

Abstract

Given a video, Video Question Answering (VideoQA) aims at answering arbitrary free-form questions about the video content in natural language. A successful VideoQA framework usually has the following two key components: 1) a discriminative video encoder that learns the effective video representation to maintain as much information as possible about the video and 2) a question-guided decoder that learns to select the most related features to perform spatiotemporal reasoning, as well as outputs the correct answer. We propose compositional attention networks (CAN) with two-stream fusion for VideoQA tasks. For the encoder, we sample video snippets using a two-stream mechanism (i.e., a uniform sampling stream and an action pooling stream) and extract a sequence of visual features for each stream to represent the video semantics with implementation. For the decoder, we propose a compositional attention module to integrate the two-stream features with the attention mechanism. The compositional attention module is the core of CAN and can be seen as a modular combination of a unified attention block. With different fusion strategies, we devise five compositional attention module variants. We evaluate our approach on one long-term VideoQA dataset, ActivityNet-QA, and two short-term VideoQA datasets, MSRVTT-QA and MSVD-QA. Our CAN model achieves new state-of-the-art results on all the datasets.

摘要

给定一个视频，视频问答（VideoQA）旨在用自然语言回答关于视频内容的任意自由形式的问题。一个成功的视频问答框架通常有以下两个关键组件：1）一个判别式视频编码器，它学习有效的视频表示，以尽可能多地保留关于视频的信息；2）一个问题引导的解码器，它学习选择最相关的特征来进行时空推理，并输出正确答案。我们提出了用于视频问答任务的具有双流融合的组合注意力网络（CAN）。对于编码器，我们使用双流机制（即均匀采样流和动作池化流）对视频片段进行采样，并为每个流提取一系列视觉特征，以实现方式表示视频语义。对于解码器，我们提出了一个组合注意力模块，通过注意力机制整合双流特征。组合注意力模块是CAN的核心，可以看作是一个统一注意力块的模块化组合。通过不同的融合策略，我们设计了五种组合注意力模块变体。我们在一个长期视频问答数据集ActivityNet-QA和两个短期视频问答数据集MSRVTT-QA和MSVD-QA上评估了我们的方法。我们的CAN模型在所有数据集上都取得了新的最优结果。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

用于视频问答的双流融合组合注意力网络。

Compositional Attention Networks with Two-Stream Fusion for Video Question Answering.

作者信息

出版信息

相似文献

用于视频问答的双流融合组合注意力网络。

Compositional Attention Networks with Two-Stream Fusion for Video Question Answering.

作者信息

出版信息

相似文献