• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种用于视频问答的多尺度自监督超图对比学习框架。

A multi-scale self-supervised hypergraph contrastive learning framework for video question answering.

机构信息

Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing 100876, China; Muroran Institute of Technology, Muroran 050-8585, Japan.

Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing 100876, China.

出版信息

Neural Netw. 2023 Nov;168:272-286. doi: 10.1016/j.neunet.2023.08.057. Epub 2023 Sep 16.

DOI:10.1016/j.neunet.2023.08.057
PMID:37774513
Abstract

Video question answering (VideoQA) is a challenging video understanding task that requires a comprehensive understanding of multimodal information and accurate answers to related questions. Most existing VideoQA models use Graph Neural Networks (GNN) to capture temporal-spatial interactions between objects. Despite achieving certain success, we argue that current schemes have two limitations: (i) existing graph-based methods require stacking multi-layers of GNN to capture high-order relations between objects, which inevitably introduces irrelevant noise; (ii) neglecting the unique self-supervised signals in the high-order relational structures among multiple objects that can facilitate more accurate QA. To this end, we propose a novel Multi-scale Self-supervised Hypergraph Contrastive Learning (MSHCL) framework for VideoQA. Specifically, we first segment the video from multiple temporal dimensions to obtain multiple frame groups. For different frame groups, we design appearance and motion hyperedges based on node semantics to connect object nodes. In this way, we construct a multi-scale temporal-spatial hypergraph to directly capture high-order relations among multiple objects. Furthermore, the node features after hypergraph convolution are injected into a Transformer to capture the global information of the input sequence. Second, we design a self-supervised hypergraph contrastive learning task based on the node- and hyperedge-dropping data augmentation and an improved question-guided multimodal interaction module to enhance the accuracy and robustness of the VideoQA model. Finally, extensive experiments on three benchmark datasets demonstrate the superiority of our proposed MSHCL compared with stat-of-the-art methods.

摘要

视频问答 (VideoQA) 是一项具有挑战性的视频理解任务,需要对多模态信息有全面的理解,并能准确回答相关问题。大多数现有的 VideoQA 模型使用图神经网络 (GNN) 来捕捉对象之间的时空交互。尽管取得了一定的成功,但我们认为当前的方案存在两个局限性:(i) 现有的基于图的方法需要堆叠多层 GNN 来捕捉对象之间的高阶关系,这不可避免地引入了无关的噪声;(ii) 忽略了多个对象之间高阶关系结构中独特的自监督信号,这些信号可以促进更准确的问答。为此,我们提出了一种新的用于 VideoQA 的多尺度自监督超图对比学习 (MSHCL) 框架。具体来说,我们首先从多个时间维度分割视频,以获得多个帧组。对于不同的帧组,我们基于节点语义设计外观和运动超边来连接对象节点。通过这种方式,我们构建了一个多尺度的时空超图,以直接捕捉多个对象之间的高阶关系。此外,超图卷积后的节点特征被注入到 Transformer 中,以捕捉输入序列的全局信息。其次,我们基于节点和超边删除的数据增强以及改进的问题引导多模态交互模块设计了一个自监督超图对比学习任务,以提高 VideoQA 模型的准确性和鲁棒性。最后,在三个基准数据集上的广泛实验表明,与最先进的方法相比,我们提出的 MSHCL 具有优越性。

相似文献

1
A multi-scale self-supervised hypergraph contrastive learning framework for video question answering.一种用于视频问答的多尺度自监督超图对比学习框架。
Neural Netw. 2023 Nov;168:272-286. doi: 10.1016/j.neunet.2023.08.057. Epub 2023 Sep 16.
2
Dynamic Spatio-Temporal Graph Reasoning for VideoQA With Self-Supervised Event Recognition.基于自监督事件识别的视频问答动态时空图推理
IEEE Trans Image Process. 2024;33:4145-4158. doi: 10.1109/TIP.2024.3411448. Epub 2024 Jul 9.
3
Contrastive Video Question Answering via Video Graph Transformer.通过视频图变换器实现对比视频问答
IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):13265-13280. doi: 10.1109/TPAMI.2023.3292266. Epub 2023 Oct 3.
4
TCGL: Temporal Contrastive Graph for Self-Supervised Video Representation Learning.TCGL:用于自监督视频表征学习的时间对比图
IEEE Trans Image Process. 2022;31:1978-1993. doi: 10.1109/TIP.2022.3147032. Epub 2022 Feb 18.
5
Masked hypergraph learning for weakly supervised histopathology whole slide image classification.基于掩蔽超图学习的弱监督病理切片图像分类。
Comput Methods Programs Biomed. 2024 Aug;253:108237. doi: 10.1016/j.cmpb.2024.108237. Epub 2024 May 23.
6
Dual-Channel Adaptive Scale Hypergraph Encoders With Cross-View Contrastive Learning for Knowledge Tracing.用于知识追踪的具有跨视图对比学习的双通道自适应尺度超图编码器
IEEE Trans Neural Netw Learn Syst. 2025 Apr;36(4):6752-6766. doi: 10.1109/TNNLS.2024.3386810. Epub 2025 Apr 4.
7
Transformer-Empowered Invariant Grounding for Video Question Answering.用于视频问答的基于Transformer的不变接地
IEEE Trans Pattern Anal Mach Intell. 2023 Aug 9;PP. doi: 10.1109/TPAMI.2023.3303451.
8
GTC: GNN-Transformer co-contrastive learning for self-supervised heterogeneous graph representation.GTC:用于自监督异构图表示的GNN-Transformer协同对比学习
Neural Netw. 2025 Jan;181:106645. doi: 10.1016/j.neunet.2024.106645. Epub 2024 Aug 16.
9
Prediction of multi-relational drug-gene interaction via Dynamic hyperGraph Contrastive Learning.通过动态超图对比学习预测多关系药物-基因相互作用
Brief Bioinform. 2023 Sep 22;24(6). doi: 10.1093/bib/bbad371.
10
Local structure-aware graph contrastive representation learning.基于局部结构感知的图对比表示学习。
Neural Netw. 2024 Apr;172:106083. doi: 10.1016/j.neunet.2023.12.037. Epub 2023 Dec 27.