Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing 100876, China; Muroran Institute of Technology, Muroran 050-8585, Japan.
Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia, Beijing University of Posts and Telecommunications, Beijing 100876, China.
Neural Netw. 2023 Nov;168:272-286. doi: 10.1016/j.neunet.2023.08.057. Epub 2023 Sep 16.
Video question answering (VideoQA) is a challenging video understanding task that requires a comprehensive understanding of multimodal information and accurate answers to related questions. Most existing VideoQA models use Graph Neural Networks (GNN) to capture temporal-spatial interactions between objects. Despite achieving certain success, we argue that current schemes have two limitations: (i) existing graph-based methods require stacking multi-layers of GNN to capture high-order relations between objects, which inevitably introduces irrelevant noise; (ii) neglecting the unique self-supervised signals in the high-order relational structures among multiple objects that can facilitate more accurate QA. To this end, we propose a novel Multi-scale Self-supervised Hypergraph Contrastive Learning (MSHCL) framework for VideoQA. Specifically, we first segment the video from multiple temporal dimensions to obtain multiple frame groups. For different frame groups, we design appearance and motion hyperedges based on node semantics to connect object nodes. In this way, we construct a multi-scale temporal-spatial hypergraph to directly capture high-order relations among multiple objects. Furthermore, the node features after hypergraph convolution are injected into a Transformer to capture the global information of the input sequence. Second, we design a self-supervised hypergraph contrastive learning task based on the node- and hyperedge-dropping data augmentation and an improved question-guided multimodal interaction module to enhance the accuracy and robustness of the VideoQA model. Finally, extensive experiments on three benchmark datasets demonstrate the superiority of our proposed MSHCL compared with stat-of-the-art methods.
视频问答 (VideoQA) 是一项具有挑战性的视频理解任务,需要对多模态信息有全面的理解,并能准确回答相关问题。大多数现有的 VideoQA 模型使用图神经网络 (GNN) 来捕捉对象之间的时空交互。尽管取得了一定的成功,但我们认为当前的方案存在两个局限性:(i) 现有的基于图的方法需要堆叠多层 GNN 来捕捉对象之间的高阶关系,这不可避免地引入了无关的噪声;(ii) 忽略了多个对象之间高阶关系结构中独特的自监督信号,这些信号可以促进更准确的问答。为此,我们提出了一种新的用于 VideoQA 的多尺度自监督超图对比学习 (MSHCL) 框架。具体来说,我们首先从多个时间维度分割视频,以获得多个帧组。对于不同的帧组,我们基于节点语义设计外观和运动超边来连接对象节点。通过这种方式,我们构建了一个多尺度的时空超图,以直接捕捉多个对象之间的高阶关系。此外,超图卷积后的节点特征被注入到 Transformer 中,以捕捉输入序列的全局信息。其次,我们基于节点和超边删除的数据增强以及改进的问题引导多模态交互模块设计了一个自监督超图对比学习任务,以提高 VideoQA 模型的准确性和鲁棒性。最后,在三个基准数据集上的广泛实验表明,与最先进的方法相比,我们提出的 MSHCL 具有优越性。