Li Guorong, Ye Hanhua, Qi Yuankai, Wang Shuhui, Qing Laiyun, Huang Qingming, Yang Ming-Hsuan
IEEE Trans Pattern Anal Mach Intell. 2024 Feb;46(2):1049-1064. doi: 10.1109/TPAMI.2023.3327677. Epub 2024 Jan 9.
Video captioning aims to generate natural language descriptions for a given video clip. Existing methods mainly focus on end-to-end representation learning via word-by-word comparison between predicted captions and ground-truth texts. Although significant progress has been made, such supervised approaches neglect semantic alignment between visual and linguistic entities, which may negatively affect the generated captions. In this work, we propose a hierarchical modular network to bridge video representations and linguistic semantics at four granularities before generating captions: entity, verb, predicate, and sentence. Each level is implemented by one module to embed corresponding semantics into video representations. Additionally, we present a reinforcement learning module based on the scene graph of captions to better measure sentence similarity. Extensive experimental results show that the proposed method performs favorably against the state-of-the-art models on three widely-used benchmark datasets, including microsoft research video description corpus (MSVD), MSR-video to text (MSR-VTT), and video-and-TEXt (VATEX).
视频字幕旨在为给定的视频片段生成自然语言描述。现有方法主要侧重于通过预测字幕与真实文本之间逐词比较进行端到端表示学习。尽管已经取得了显著进展,但这种监督方法忽略了视觉和语言实体之间的语义对齐,这可能会对生成的字幕产生负面影响。在这项工作中,我们提出了一种分层模块化网络,在生成字幕之前从实体、动词、谓语和句子四个粒度层次上弥合视频表示与语言语义之间的差距。每个层次由一个模块实现,将相应的语义嵌入到视频表示中。此外,我们提出了一种基于字幕场景图的强化学习模块,以更好地衡量句子相似度。大量实验结果表明,在包括微软研究视频描述语料库(MSVD)、微软研究院视频到文本(MSR-VTT)和视频与文本(VATEX)在内的三个广泛使用的基准数据集上,该方法的性能优于现有最先进的模型。