Suppr超能文献

用于视频字幕的分层模块化网络学习

Learning Hierarchical Modular Networks for Video Captioning.

作者信息

Li Guorong, Ye Hanhua, Qi Yuankai, Wang Shuhui, Qing Laiyun, Huang Qingming, Yang Ming-Hsuan

出版信息

IEEE Trans Pattern Anal Mach Intell. 2024 Feb;46(2):1049-1064. doi: 10.1109/TPAMI.2023.3327677. Epub 2024 Jan 9.

Abstract

Video captioning aims to generate natural language descriptions for a given video clip. Existing methods mainly focus on end-to-end representation learning via word-by-word comparison between predicted captions and ground-truth texts. Although significant progress has been made, such supervised approaches neglect semantic alignment between visual and linguistic entities, which may negatively affect the generated captions. In this work, we propose a hierarchical modular network to bridge video representations and linguistic semantics at four granularities before generating captions: entity, verb, predicate, and sentence. Each level is implemented by one module to embed corresponding semantics into video representations. Additionally, we present a reinforcement learning module based on the scene graph of captions to better measure sentence similarity. Extensive experimental results show that the proposed method performs favorably against the state-of-the-art models on three widely-used benchmark datasets, including microsoft research video description corpus (MSVD), MSR-video to text (MSR-VTT), and video-and-TEXt (VATEX).

摘要

视频字幕旨在为给定的视频片段生成自然语言描述。现有方法主要侧重于通过预测字幕与真实文本之间逐词比较进行端到端表示学习。尽管已经取得了显著进展,但这种监督方法忽略了视觉和语言实体之间的语义对齐,这可能会对生成的字幕产生负面影响。在这项工作中,我们提出了一种分层模块化网络,在生成字幕之前从实体、动词、谓语和句子四个粒度层次上弥合视频表示与语言语义之间的差距。每个层次由一个模块实现,将相应的语义嵌入到视频表示中。此外,我们提出了一种基于字幕场景图的强化学习模块,以更好地衡量句子相似度。大量实验结果表明,在包括微软研究视频描述语料库(MSVD)、微软研究院视频到文本(MSR-VTT)和视频与文本(VATEX)在内的三个广泛使用的基准数据集上,该方法的性能优于现有最先进的模型。

相似文献

1
Learning Hierarchical Modular Networks for Video Captioning.用于视频字幕的分层模块化网络学习
IEEE Trans Pattern Anal Mach Intell. 2024 Feb;46(2):1049-1064. doi: 10.1109/TPAMI.2023.3327677. Epub 2024 Jan 9.
2
Aligning Source Visual and Target Language Domains for Unpaired Video Captioning.为无配对视频字幕对齐源视觉和目标语言领域
IEEE Trans Pattern Anal Mach Intell. 2022 Dec;44(12):9255-9268. doi: 10.1109/TPAMI.2021.3132229. Epub 2022 Nov 7.
3
Semantic guidance network for video captioning.用于视频字幕的语义引导网络。
Sci Rep. 2023 Sep 26;13(1):16076. doi: 10.1038/s41598-023-43010-3.
5
Video Captioning Using Global-Local Representation.使用全局-局部表示的视频字幕
IEEE Trans Circuits Syst Video Technol. 2022 Oct;32(10):6642-6656. doi: 10.1109/tcsvt.2022.3177320. Epub 2022 May 23.
6
Visual Commonsense-Aware Representation Network for Video Captioning.用于视频字幕的视觉常识感知表示网络。
IEEE Trans Neural Netw Learn Syst. 2025 Jan;36(1):1092-1103. doi: 10.1109/TNNLS.2023.3323491. Epub 2025 Jan 7.
9
Cross-Modal Graph With Meta Concepts for Video Captioning.用于视频字幕的带有元概念的跨模态图
IEEE Trans Image Process. 2022;31:5150-5162. doi: 10.1109/TIP.2022.3192709. Epub 2022 Aug 2.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验