IEEE Trans Image Process. 2022;31:5936-5948. doi: 10.1109/TIP.2022.3205212. Epub 2022 Sep 15.
Video Question Answering (VideoQA), which explores spatial-temporal visual information of videos given a linguistic query, has received unprecedented attention over recent years. One of the main challenges lies in locating relevant visual and linguistic information, and therefore various attention-based approaches are proposed. Despite the impressive progress, two aspects are not fully explored by current methods to get proper attention. Firstly, prior knowledge, which in the human cognitive process plays an important role in assisting the reasoning process of VideoQA, is not fully utilized. Secondly, structured visual information (e.g., object) instead of the raw video is underestimated. To address the above two issues, we propose a Prior Knowledge and Object-sensitive Learning (PKOL) by exploring the effect of prior knowledge and learning object-sensitive representations to boost the VideoQA task. Specifically, we first propose a Prior Knowledge Exploring (PKE) module that aims to acquire and integrate prior knowledge into a question feature for feature enriching, where an information retriever is constructed to retrieve related sentences as prior knowledge from the massive corpus. In addition, we propose an Object-sensitive Representation Learning (ORL) module to generate object-sensitive features by interacting object-level features with frame and clip-level features. Our proposed PKOL achieves consistent improvements on three competitive benchmarks (i.e., MSVD-QA, MSRVTT-QA, and TGIF-QA) and gains state-of-the-art performance. The source code is available at https://github.com/zchoi/PKOL.
视频问答(VideoQA)探索了给定语言查询的视频的时空视觉信息,近年来受到了前所未有的关注。主要挑战之一在于定位相关的视觉和语言信息,因此提出了各种基于注意力的方法。尽管取得了令人瞩目的进展,但当前方法并没有充分探索两个方面,以获得适当的关注。首先,人类认知过程中起着重要作用、有助于视频问答推理过程的先验知识没有得到充分利用。其次,结构化视觉信息(例如,对象)而不是原始视频被低估了。为了解决上述两个问题,我们通过探索先验知识和学习对象敏感表示的效果来提出先验知识和对象敏感学习(PKOL),以提高视频问答任务的性能。具体来说,我们首先提出了一个先验知识探索(PKE)模块,旨在获取和整合先验知识到问题特征中,以进行特征丰富,其中构建了一个信息检索器,从大规模语料库中检索相关句子作为先验知识。此外,我们提出了一个对象敏感表示学习(ORL)模块,通过与对象级特征交互来生成对象敏感特征与帧和片段级特征。我们提出的 PKOL 在三个具有竞争力的基准(即 MSVD-QA、MSRVTT-QA 和 TGIF-QA)上取得了一致的改进,并获得了最先进的性能。代码可在 https://github.com/zchoi/PKOL 上获得。