用于视频字幕和视频问答的带有辅助任务的分层表示网络

Hierarchical Representation Network With Auxiliary Tasks for Video Captioning and Video Question Answering.

作者信息

Gao Lianli, Lei Yu, Zeng Pengpeng, Song Jingkuan, Wang Meng, Shen Heng Tao

出版信息

IEEE Trans Image Process. 2022;31:202-215. doi: 10.1109/TIP.2021.3120867. Epub 2021 Dec 3.

DOI:10.1109/TIP.2021.3120867

Abstract

Recently, integrating vision and language for in-depth video understanding e.g., video captioning and video question answering, has become a promising direction for artificial intelligence. However, due to the complexity of video information, it is challenging to extract a video feature that can well represent multiple levels of concepts i.e., objects, actions and events. Meanwhile, content completeness and syntactic consistency play an important role in high-quality language-related video understanding. Motivated by these, we propose a novel framework, named Hierarchical Representation Network with Auxiliary Tasks (HRNAT), for learning multi-level representations and obtaining syntax-aware video captions. Specifically, the Cross-modality Matching Task enables the learning of hierarchical representation of videos, guided by the three-level representation of languages. The Syntax-guiding Task and the Vision-assist Task contribute to generating descriptions which are not only globally similar to the video content, but also syntax-consistent to the ground-truth description. The key components of our model are general and they can be readily applied to both video captioning and video question answering tasks. Performances for the above tasks on several benchmark datasets validate the effectiveness and superiority of our proposed method compared with the state-of-the-art methods. Codes and models are also released https://github.com/riesling00/HRNAT.

摘要

最近，将视觉与语言相结合以实现深度视频理解，例如视频字幕生成和视频问答，已成为人工智能领域一个很有前景的方向。然而，由于视频信息的复杂性，提取能够很好地表示多个概念层次（即对象、动作和事件）的视频特征具有挑战性。同时，内容完整性和句法一致性在高质量的与语言相关的视频理解中起着重要作用。受此启发，我们提出了一种名为带辅助任务的分层表示网络（HRNAT）的新颖框架，用于学习多层次表示并获得具有句法感知的视频字幕。具体而言，跨模态匹配任务能够在语言的三级表示的引导下学习视频的分层表示。句法引导任务和视觉辅助任务有助于生成不仅在全局上与视频内容相似，而且在句法上与真实描述一致的描述。我们模型的关键组件具有通用性，它们可以很容易地应用于视频字幕生成和视频问答任务。在几个基准数据集上针对上述任务的性能验证了我们提出的方法与现有最先进方法相比的有效性和优越性。代码和模型也已发布，网址为https://github.com/riesling00/HRNAT。

相似文献

Hierarchical Representation Network With Auxiliary Tasks for Video Captioning and Video Question Answering.用于视频字幕和视频问答的带有辅助任务的分层表示网络

IEEE Trans Image Process. 2022;31:202-215. doi: 10.1109/TIP.2021.3120867. Epub 2021 Dec 3.

Visual Commonsense-Aware Representation Network for Video Captioning.用于视频字幕的视觉常识感知表示网络。

IEEE Trans Neural Netw Learn Syst. 2025 Jan;36(1):1092-1103. doi: 10.1109/TNNLS.2023.3323491. Epub 2025 Jan 7.

Syntax Customized Video Captioning by Imitating Exemplar Sentences.通过模仿范例句子进行语法定制化视频字幕生成。

IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):10209-10221. doi: 10.1109/TPAMI.2021.3131618. Epub 2022 Nov 7.

Learning Hierarchical Modular Networks for Video Captioning.用于视频字幕的分层模块化网络学习

IEEE Trans Pattern Anal Mach Intell. 2024 Feb;46(2):1049-1064. doi: 10.1109/TPAMI.2023.3327677. Epub 2024 Jan 9.

MemBridge: Video-Language Pre-Training With Memory-Augmented Inter-Modality Bridge.MemBridge：通过记忆增强跨模态桥接进行视频-语言预训练

IEEE Trans Image Process. 2023;32:4073-4087. doi: 10.1109/TIP.2023.3283916. Epub 2023 Jul 19.

Video Captioning Using Global-Local Representation.使用全局-局部表示的视频字幕

IEEE Trans Circuits Syst Video Technol. 2022 Oct;32(10):6642-6656. doi: 10.1109/tcsvt.2022.3177320. Epub 2022 May 23.

Concept-Aware Video Captioning: Describing Videos With Effective Prior Information.概念感知视频字幕：利用有效先验信息描述视频

IEEE Trans Image Process. 2023;32:5366-5378. doi: 10.1109/TIP.2023.3307969. Epub 2023 Oct 2.

Multi-Turn Video Question Answering via Hierarchical Attention Context Reinforced Networks.通过分层注意力上下文增强网络实现多轮视频问答

IEEE Trans Image Process. 2019 Aug;28(8):3860-3872. doi: 10.1109/TIP.2019.2902106. Epub 2019 Feb 27.

Video Captioning with Object-Aware Spatio-Temporal Correlation and Aggregation.具有目标感知时空相关性与聚合的视频字幕

IEEE Trans Image Process. 2020 Apr 27. doi: 10.1109/TIP.2020.2988435.

Semantic guidance network for video captioning.用于视频字幕的语义引导网络。

Sci Rep. 2023 Sep 26;13(1):16076. doi: 10.1038/s41598-023-43010-3.

引用本文的文献

Semantic guidance network for video captioning.用于视频字幕的语义引导网络。

Sci Rep. 2023 Sep 26;13(1):16076. doi: 10.1038/s41598-023-43010-3.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

用于视频字幕和视频问答的带有辅助任务的分层表示网络

Hierarchical Representation Network With Auxiliary Tasks for Video Captioning and Video Question Answering.

作者信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献