Suppr超能文献

用于视频字幕和视频问答的带有辅助任务的分层表示网络

Hierarchical Representation Network With Auxiliary Tasks for Video Captioning and Video Question Answering.

作者信息

Gao Lianli, Lei Yu, Zeng Pengpeng, Song Jingkuan, Wang Meng, Shen Heng Tao

出版信息

IEEE Trans Image Process. 2022;31:202-215. doi: 10.1109/TIP.2021.3120867. Epub 2021 Dec 3.

Abstract

Recently, integrating vision and language for in-depth video understanding e.g., video captioning and video question answering, has become a promising direction for artificial intelligence. However, due to the complexity of video information, it is challenging to extract a video feature that can well represent multiple levels of concepts i.e., objects, actions and events. Meanwhile, content completeness and syntactic consistency play an important role in high-quality language-related video understanding. Motivated by these, we propose a novel framework, named Hierarchical Representation Network with Auxiliary Tasks (HRNAT), for learning multi-level representations and obtaining syntax-aware video captions. Specifically, the Cross-modality Matching Task enables the learning of hierarchical representation of videos, guided by the three-level representation of languages. The Syntax-guiding Task and the Vision-assist Task contribute to generating descriptions which are not only globally similar to the video content, but also syntax-consistent to the ground-truth description. The key components of our model are general and they can be readily applied to both video captioning and video question answering tasks. Performances for the above tasks on several benchmark datasets validate the effectiveness and superiority of our proposed method compared with the state-of-the-art methods. Codes and models are also released https://github.com/riesling00/HRNAT.

摘要

最近,将视觉与语言相结合以实现深度视频理解,例如视频字幕生成和视频问答,已成为人工智能领域一个很有前景的方向。然而,由于视频信息的复杂性,提取能够很好地表示多个概念层次(即对象、动作和事件)的视频特征具有挑战性。同时,内容完整性和句法一致性在高质量的与语言相关的视频理解中起着重要作用。受此启发,我们提出了一种名为带辅助任务的分层表示网络(HRNAT)的新颖框架,用于学习多层次表示并获得具有句法感知的视频字幕。具体而言,跨模态匹配任务能够在语言的三级表示的引导下学习视频的分层表示。句法引导任务和视觉辅助任务有助于生成不仅在全局上与视频内容相似,而且在句法上与真实描述一致的描述。我们模型的关键组件具有通用性,它们可以很容易地应用于视频字幕生成和视频问答任务。在几个基准数据集上针对上述任务的性能验证了我们提出的方法与现有最先进方法相比的有效性和优越性。代码和模型也已发布,网址为https://github.com/riesling00/HRNAT。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验