一种用于视频问答的多尺度自监督超图对比学习框架。

Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing 100876, China; Muroran Institute of Technology, Muroran 050-8585, Japan.

Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing 100876, China.

Neural Netw. 2023 Nov;168:272-286. doi: 10.1016/j.neunet.2023.08.057. Epub 2023 Sep 16.

Video question answering (VideoQA) is a challenging video understanding task that requires a comprehensive understanding of multimodal information and accurate answers to related questions. Most existing VideoQA models use Graph Neural Networks (GNN) to capture temporal-spatial interactions between objects. Despite achieving certain success, we argue that current schemes have two limitations: (i) existing graph-based methods require stacking multi-layers of GNN to capture high-order relations between objects, which inevitably introduces irrelevant noise; (ii) neglecting the unique self-supervised signals in the high-order relational structures among multiple objects that can facilitate more accurate QA. To this end, we propose a novel Multi-scale Self-supervised Hypergraph Contrastive Learning (MSHCL) framework for VideoQA. Specifically, we first segment the video from multiple temporal dimensions to obtain multiple frame groups. For different frame groups, we design appearance and motion hyperedges based on node semantics to connect object nodes. In this way, we construct a multi-scale temporal-spatial hypergraph to directly capture high-order relations among multiple objects. Furthermore, the node features after hypergraph convolution are injected into a Transformer to capture the global information of the input sequence. Second, we design a self-supervised hypergraph contrastive learning task based on the node- and hyperedge-dropping data augmentation and an improved question-guided multimodal interaction module to enhance the accuracy and robustness of the VideoQA model. Finally, extensive experiments on three benchmark datasets demonstrate the superiority of our proposed MSHCL compared with stat-of-the-art methods.

视频问答 (VideoQA) 是一项具有挑战性的视频理解任务，需要对多模态信息有全面的理解，并能准确回答相关问题。大多数现有的 VideoQA 模型使用图神经网络 (GNN) 来捕捉对象之间的时空交互。尽管取得了一定的成功，但我们认为当前的方案存在两个局限性：(i) 现有的基于图的方法需要堆叠多层 GNN 来捕捉对象之间的高阶关系，这不可避免地引入了无关的噪声；(ii) 忽略了多个对象之间高阶关系结构中独特的自监督信号，这些信号可以促进更准确的问答。为此，我们提出了一种新的用于 VideoQA 的多尺度自监督超图对比学习 (MSHCL) 框架。具体来说，我们首先从多个时间维度分割视频，以获得多个帧组。对于不同的帧组，我们基于节点语义设计外观和运动超边来连接对象节点。通过这种方式，我们构建了一个多尺度的时空超图，以直接捕捉多个对象之间的高阶关系。此外，超图卷积后的节点特征被注入到 Transformer 中，以捕捉输入序列的全局信息。其次，我们基于节点和超边删除的数据增强以及改进的问题引导多模态交互模块设计了一个自监督超图对比学习任务，以提高 VideoQA 模型的准确性和鲁棒性。最后，在三个基准数据集上的广泛实验表明，与最先进的方法相比，我们提出的 MSHCL 具有优越性。

相似文献

A multi-scale self-supervised hypergraph contrastive learning framework for video question answering.

Neural Netw. 2023 Nov;168:272-286. doi: 10.1016/j.neunet.2023.08.057. Epub 2023 Sep 16.

Dynamic Spatio-Temporal Graph Reasoning for VideoQA With Self-Supervised Event Recognition.

IEEE Trans Image Process. 2024;33:4145-4158. doi: 10.1109/TIP.2024.3411448. Epub 2024 Jul 9.

Contrastive Video Question Answering via Video Graph Transformer.

IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):13265-13280. doi: 10.1109/TPAMI.2023.3292266. Epub 2023 Oct 3.

TCGL: Temporal Contrastive Graph for Self-Supervised Video Representation Learning.

IEEE Trans Image Process. 2022;31:1978-1993. doi: 10.1109/TIP.2022.3147032. Epub 2022 Feb 18.

Masked hypergraph learning for weakly supervised histopathology whole slide image classification.

Comput Methods Programs Biomed. 2024 Aug;253:108237. doi: 10.1016/j.cmpb.2024.108237. Epub 2024 May 23.

Dual-Channel Adaptive Scale Hypergraph Encoders With Cross-View Contrastive Learning for Knowledge Tracing.

IEEE Trans Neural Netw Learn Syst. 2025 Apr;36(4):6752-6766. doi: 10.1109/TNNLS.2024.3386810. Epub 2025 Apr 4.

Transformer-Empowered Invariant Grounding for Video Question Answering.

IEEE Trans Pattern Anal Mach Intell. 2023 Aug 9;PP. doi: 10.1109/TPAMI.2023.3303451.

GTC: GNN-Transformer co-contrastive learning for self-supervised heterogeneous graph representation.

Neural Netw. 2025 Jan;181:106645. doi: 10.1016/j.neunet.2024.106645. Epub 2024 Aug 16.

Prediction of multi-relational drug-gene interaction via Dynamic hyperGraph Contrastive Learning.

Brief Bioinform. 2023 Sep 22;24(6). doi: 10.1093/bib/bbad371.

Local structure-aware graph contrastive representation learning.

Neural Netw. 2024 Apr;172:106083. doi: 10.1016/j.neunet.2023.12.037. Epub 2023 Dec 27.

Suppr 超能文献

核心技术专利：CN118964589B侵权必究

相似文献

A multi-scale self-supervised hypergraph contrastive learning framework for video question answering.

Neural Netw. 2023 Nov;168:272-286. doi: 10.1016/j.neunet.2023.08.057. Epub 2023 Sep 16.

Dynamic Spatio-Temporal Graph Reasoning for VideoQA With Self-Supervised Event Recognition.

IEEE Trans Image Process. 2024;33:4145-4158. doi: 10.1109/TIP.2024.3411448. Epub 2024 Jul 9.

Contrastive Video Question Answering via Video Graph Transformer.

IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):13265-13280. doi: 10.1109/TPAMI.2023.3292266. Epub 2023 Oct 3.

TCGL: Temporal Contrastive Graph for Self-Supervised Video Representation Learning.

IEEE Trans Image Process. 2022;31:1978-1993. doi: 10.1109/TIP.2022.3147032. Epub 2022 Feb 18.

Masked hypergraph learning for weakly supervised histopathology whole slide image classification.

Comput Methods Programs Biomed. 2024 Aug;253:108237. doi: 10.1016/j.cmpb.2024.108237. Epub 2024 May 23.

Dual-Channel Adaptive Scale Hypergraph Encoders With Cross-View Contrastive Learning for Knowledge Tracing.

IEEE Trans Neural Netw Learn Syst. 2025 Apr;36(4):6752-6766. doi: 10.1109/TNNLS.2024.3386810. Epub 2025 Apr 4.

Transformer-Empowered Invariant Grounding for Video Question Answering.

IEEE Trans Pattern Anal Mach Intell. 2023 Aug 9;PP. doi: 10.1109/TPAMI.2023.3303451.

GTC: GNN-Transformer co-contrastive learning for self-supervised heterogeneous graph representation.

Neural Netw. 2025 Jan;181:106645. doi: 10.1016/j.neunet.2024.106645. Epub 2024 Aug 16.

Prediction of multi-relational drug-gene interaction via Dynamic hyperGraph Contrastive Learning.

Brief Bioinform. 2023 Sep 22;24(6). doi: 10.1093/bib/bbad371.

Local structure-aware graph contrastive representation learning.

Neural Netw. 2024 Apr;172:106083. doi: 10.1016/j.neunet.2023.12.037. Epub 2023 Dec 27.

Suppr
超能文献

A multi-scale self-supervised hypergraph contrastive learning framework for video question answering.

机构信息

出版信息

相似文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

Suppr超能文献

一种用于视频问答的多尺度自监督超图对比学习框架。

A multi-scale self-supervised hypergraph contrastive learning framework for video question answering.

机构信息

出版信息

相似文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

Suppr
超能文献