Suppr超能文献

用于视频字幕的带有元概念的跨模态图

Cross-Modal Graph With Meta Concepts for Video Captioning.

作者信息

Wang Hao, Lin Guosheng, Hoi Steven C H, Miao Chunyan

出版信息

IEEE Trans Image Process. 2022;31:5150-5162. doi: 10.1109/TIP.2022.3192709. Epub 2022 Aug 2.

Abstract

Video captioning targets interpreting the complex visual contents as text descriptions, which requires the model to fully understand video scenes including objects and their interactions. Prevailing methods adopt off-the-shelf object detection networks to give object proposals and use the attention mechanism to model the relations between objects. They often miss some undefined semantic concepts of the pretrained model and fail to identify exact predicate relationships between objects. In this paper, we investigate an open research task of generating text descriptions for the given videos, and propose Cross-Modal Graph (CMG) with meta concepts for video captioning. Specifically, to cover the useful semantic concepts in video captions, we weakly learn the corresponding visual regions for text descriptions, where the associated visual regions and textual words are named cross-modal meta concepts. We further build meta concept graphs dynamically with the learned cross-modal meta concepts. We also construct holistic video-level and local frame-level video graphs with the predicted predicates to model video sequence structures. We validate the efficacy of our proposed techniques with extensive experiments and achieve state-of-the-art results on two public datasets.

摘要

视频字幕旨在将复杂的视觉内容解释为文本描述,这要求模型充分理解包括物体及其交互在内的视频场景。现有方法采用现成的目标检测网络来给出目标提议,并使用注意力机制对物体之间的关系进行建模。它们常常会遗漏预训练模型中一些未定义的语义概念,并且无法识别物体之间确切的谓词关系。在本文中,我们研究了一项针对给定视频生成文本描述的开放性研究任务,并提出了用于视频字幕的带有元概念的跨模态图(CMG)。具体而言,为了涵盖视频字幕中的有用语义概念,我们为文本描述弱学习相应的视觉区域,其中相关的视觉区域和文本词汇被称为跨模态元概念。我们进一步利用学习到的跨模态元概念动态构建元概念图。我们还使用预测的谓词构建整体视频级和局部帧级视频图,以对视频序列结构进行建模。我们通过大量实验验证了所提技术的有效性,并在两个公共数据集上取得了领先的结果。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验