Gao Lianli, Lei Yu, Zeng Pengpeng, Song Jingkuan, Wang Meng, Shen Heng Tao
IEEE Trans Image Process. 2022;31:202-215. doi: 10.1109/TIP.2021.3120867. Epub 2021 Dec 3.
Recently, integrating vision and language for in-depth video understanding e.g., video captioning and video question answering, has become a promising direction for artificial intelligence. However, due to the complexity of video information, it is challenging to extract a video feature that can well represent multiple levels of concepts i.e., objects, actions and events. Meanwhile, content completeness and syntactic consistency play an important role in high-quality language-related video understanding. Motivated by these, we propose a novel framework, named Hierarchical Representation Network with Auxiliary Tasks (HRNAT), for learning multi-level representations and obtaining syntax-aware video captions. Specifically, the Cross-modality Matching Task enables the learning of hierarchical representation of videos, guided by the three-level representation of languages. The Syntax-guiding Task and the Vision-assist Task contribute to generating descriptions which are not only globally similar to the video content, but also syntax-consistent to the ground-truth description. The key components of our model are general and they can be readily applied to both video captioning and video question answering tasks. Performances for the above tasks on several benchmark datasets validate the effectiveness and superiority of our proposed method compared with the state-of-the-art methods. Codes and models are also released https://github.com/riesling00/HRNAT.
最近,将视觉与语言相结合以实现深度视频理解,例如视频字幕生成和视频问答,已成为人工智能领域一个很有前景的方向。然而,由于视频信息的复杂性,提取能够很好地表示多个概念层次(即对象、动作和事件)的视频特征具有挑战性。同时,内容完整性和句法一致性在高质量的与语言相关的视频理解中起着重要作用。受此启发,我们提出了一种名为带辅助任务的分层表示网络(HRNAT)的新颖框架,用于学习多层次表示并获得具有句法感知的视频字幕。具体而言,跨模态匹配任务能够在语言的三级表示的引导下学习视频的分层表示。句法引导任务和视觉辅助任务有助于生成不仅在全局上与视频内容相似,而且在句法上与真实描述一致的描述。我们模型的关键组件具有通用性,它们可以很容易地应用于视频字幕生成和视频问答任务。在几个基准数据集上针对上述任务的性能验证了我们提出的方法与现有最先进方法相比的有效性和优越性。代码和模型也已发布,网址为https://github.com/riesling00/HRNAT。