• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

用于视频问答的双流融合组合注意力网络。

Compositional Attention Networks with Two-Stream Fusion for Video Question Answering.

作者信息

Yu Ting, Yu Jun, Yu Zhou, Tao Dacheng

出版信息

IEEE Trans Image Process. 2019 Sep 16. doi: 10.1109/TIP.2019.2940677.

DOI:10.1109/TIP.2019.2940677
PMID:31535995
Abstract

Given a video, Video Question Answering (VideoQA) aims at answering arbitrary free-form questions about the video content in natural language. A successful VideoQA framework usually has the following two key components: 1) a discriminative video encoder that learns the effective video representation to maintain as much information as possible about the video and 2) a question-guided decoder that learns to select the most related features to perform spatiotemporal reasoning, as well as outputs the correct answer. We propose compositional attention networks (CAN) with two-stream fusion for VideoQA tasks. For the encoder, we sample video snippets using a two-stream mechanism (i.e., a uniform sampling stream and an action pooling stream) and extract a sequence of visual features for each stream to represent the video semantics with implementation. For the decoder, we propose a compositional attention module to integrate the two-stream features with the attention mechanism. The compositional attention module is the core of CAN and can be seen as a modular combination of a unified attention block. With different fusion strategies, we devise five compositional attention module variants. We evaluate our approach on one long-term VideoQA dataset, ActivityNet-QA, and two short-term VideoQA datasets, MSRVTT-QA and MSVD-QA. Our CAN model achieves new state-of-the-art results on all the datasets.

摘要

给定一个视频,视频问答(VideoQA)旨在用自然语言回答关于视频内容的任意自由形式的问题。一个成功的视频问答框架通常有以下两个关键组件:1)一个判别式视频编码器,它学习有效的视频表示,以尽可能多地保留关于视频的信息;2)一个问题引导的解码器,它学习选择最相关的特征来进行时空推理,并输出正确答案。我们提出了用于视频问答任务的具有双流融合的组合注意力网络(CAN)。对于编码器,我们使用双流机制(即均匀采样流和动作池化流)对视频片段进行采样,并为每个流提取一系列视觉特征,以实现方式表示视频语义。对于解码器,我们提出了一个组合注意力模块,通过注意力机制整合双流特征。组合注意力模块是CAN的核心,可以看作是一个统一注意力块的模块化组合。通过不同的融合策略,我们设计了五种组合注意力模块变体。我们在一个长期视频问答数据集ActivityNet-QA和两个短期视频问答数据集MSRVTT-QA和MSVD-QA上评估了我们的方法。我们的CAN模型在所有数据集上都取得了新的最优结果。

相似文献

1
Compositional Attention Networks with Two-Stream Fusion for Video Question Answering.用于视频问答的双流融合组合注意力网络。
IEEE Trans Image Process. 2019 Sep 16. doi: 10.1109/TIP.2019.2940677.
2
Learning to Answer Visual Questions From Web Videos.学习从网络视频中回答视觉问题。
IEEE Trans Pattern Anal Mach Intell. 2025 May;47(5):3202-3218. doi: 10.1109/TPAMI.2022.3173208. Epub 2025 Apr 8.
3
Video Question Answering With Prior Knowledge and Object-Sensitive Learning.基于先验知识和对象敏感学习的视频问答
IEEE Trans Image Process. 2022;31:5936-5948. doi: 10.1109/TIP.2022.3205212. Epub 2022 Sep 15.
4
Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video Question Answering.用于视频问答的交叉注意力时空语义图网络
IEEE Trans Image Process. 2022;31:1684-1696. doi: 10.1109/TIP.2022.3142526. Epub 2022 Feb 3.
5
Bilinear pooling in video-QA: empirical challenges and motivational drift from neurological parallels.视频问答中的双线性池化:来自神经学相似性的实证挑战与动机漂移
PeerJ Comput Sci. 2022 Jun 3;8:e974. doi: 10.7717/peerj-cs.974. eCollection 2022.
6
Multi-Turn Video Question Answering via Hierarchical Attention Context Reinforced Networks.通过分层注意力上下文增强网络实现多轮视频问答
IEEE Trans Image Process. 2019 Aug;28(8):3860-3872. doi: 10.1109/TIP.2019.2902106. Epub 2019 Feb 27.
7
Transformer-Empowered Invariant Grounding for Video Question Answering.用于视频问答的基于Transformer的不变接地
IEEE Trans Pattern Anal Mach Intell. 2023 Aug 9;PP. doi: 10.1109/TPAMI.2023.3303451.
8
Question-Guided Erasing-Based Spatiotemporal Attention Learning for Video Question Answering.基于问题引导擦除的视频问答时空注意力学习。
IEEE Trans Neural Netw Learn Syst. 2023 Mar;34(3):1367-1379. doi: 10.1109/TNNLS.2021.3105280. Epub 2023 Feb 28.
9
A multi-scale self-supervised hypergraph contrastive learning framework for video question answering.一种用于视频问答的多尺度自监督超图对比学习框架。
Neural Netw. 2023 Nov;168:272-286. doi: 10.1016/j.neunet.2023.08.057. Epub 2023 Sep 16.
10
Event Graph Guided Compositional Spatial-Temporal Reasoning for Video Question Answering.用于视频问答的事件图引导的组合式时空推理
IEEE Trans Image Process. 2024;33:1109-1121. doi: 10.1109/TIP.2024.3358726. Epub 2024 Feb 5.