具有目标感知时空相关性与聚合的视频字幕

Video Captioning with Object-Aware Spatio-Temporal Correlation and Aggregation.

作者信息

Zhang Junchao, Peng Yuxin

出版信息

IEEE Trans Image Process. 2020 Apr 27. doi: 10.1109/TIP.2020.2988435.

DOI:10.1109/TIP.2020.2988435

Abstract

Video captioning is a significant challenging task in computer vision and natural language processing, aiming to automatically describe video content by natural language sentences. Comprehensive understanding of video is the key for accurate video captioning, which needs to not only capture the global content and salient objects in video, but also understand the spatio-temporal relations of objects, including their temporal trajectories and spatial relationships. Thus, it is important for video captioning to capture the objects' relationships both within and across frames. Therefore, in this paper, we propose an object-aware spatio-temporal graph (OSTG) approach for video captioning. It constructs spatio-temporal graphs to depict objects with their relations, where the temporal graphs represent objects' inter-frame dynamics, and the spatial graphs represent objects' intra-frame interactive relationships. The main novelties and advantages are: (1) Bidirectional temporal alignment: Bidirectional temporal graph is constructed along and reversely along the temporal order to perform bidirectional temporal alignment for objects across different frames, which provides complementary clues to capture the inter-frame temporal trajectories for each salient object. (2) Graph based spatial relation learning: Spatial relation graph is constructed among objects in each frame by considering their relative spatial locations and semantic correlations, which is exploited to learn relation features that encode intra-frame relationships for salient objects. (3) Object-aware feature aggregation: Trainable VLAD (vector of locally aggregated descriptors) models are deployed to perform object-aware feature aggregation on objects' local features, which learn discriminative aggregated representations for better video captioning. A hierarchical attention mechanism is also developed to distinguish contributions of different object instances. Experiments on two widely-used datasets, MSR-VTT and MSVD, demonstrate our proposed approach achieves state-of-the-art performances in terms of BLEU@4, METEOR and CIDEr metrics.

摘要

视频字幕生成是计算机视觉和自然语言处理中一项极具挑战性的任务，旨在通过自然语言句子自动描述视频内容。对视频的全面理解是准确生成视频字幕的关键，这不仅需要捕捉视频中的全局内容和显著物体，还需要理解物体的时空关系，包括它们的时间轨迹和空间关系。因此，捕捉帧内和跨帧的物体关系对于视频字幕生成至关重要。为此，在本文中，我们提出了一种用于视频字幕生成的物体感知时空图（OSTG）方法。它构建时空图来描绘物体及其关系，其中时间图表示物体的帧间动态，空间图表示物体的帧内交互关系。主要的新颖之处和优势在于：（1）双向时间对齐：沿时间顺序正向和反向构建双向时间图，对不同帧中的物体进行双向时间对齐，为每个显著物体捕捉帧间时间轨迹提供互补线索。（2）基于图的空间关系学习：通过考虑物体的相对空间位置和语义相关性，在每一帧中的物体之间构建空间关系图，利用该图学习编码显著物体帧内关系的关系特征。（3）物体感知特征聚合：部署可训练的VLAD（局部聚合描述符向量）模型对物体的局部特征进行物体感知特征聚合，学习判别性的聚合表示以实现更好的视频字幕生成。还开发了一种分层注意力机制来区分不同物体实例的贡献。在两个广泛使用的数据集MSR-VTT和MSVD上的实验表明，我们提出的方法在BLEU@4、METEOR和CIDEr指标方面取得了领先的性能。

相似文献

Video Captioning with Object-Aware Spatio-Temporal Correlation and Aggregation.具有目标感知时空相关性与聚合的视频字幕

IEEE Trans Image Process. 2020 Apr 27. doi: 10.1109/TIP.2020.2988435.

Long Short-Term Relation Transformer With Global Gating for Video Captioning.用于视频字幕的带全局门控的长短时关系变换器

IEEE Trans Image Process. 2022;31:2726-2738. doi: 10.1109/TIP.2022.3158546. Epub 2022 Mar 29.

Video Captioning Using Global-Local Representation.使用全局-局部表示的视频字幕

IEEE Trans Circuits Syst Video Technol. 2022 Oct;32(10):6642-6656. doi: 10.1109/tcsvt.2022.3177320. Epub 2022 May 23.

Cross-Modal Graph With Meta Concepts for Video Captioning.用于视频字幕的带有元概念的跨模态图

IEEE Trans Image Process. 2022;31:5150-5162. doi: 10.1109/TIP.2022.3192709. Epub 2022 Aug 2.

Research on Video Captioning Based on Multifeature Fusion.基于多特征融合的视频字幕研究。

Comput Intell Neurosci. 2022 Apr 28;2022:1204909. doi: 10.1155/2022/1204909. eCollection 2022.

Video captioning based on vision transformer and reinforcement learning.基于视觉Transformer和强化学习的视频字幕

PeerJ Comput Sci. 2022 Mar 16;8:e916. doi: 10.7717/peerj-cs.916. eCollection 2022.

Visual Commonsense-Aware Representation Network for Video Captioning.用于视频字幕的视觉常识感知表示网络。

IEEE Trans Neural Netw Learn Syst. 2025 Jan;36(1):1092-1103. doi: 10.1109/TNNLS.2023.3323491. Epub 2025 Jan 7.

Adaptive Spatio-Temporal Graph Enhanced Vision-Language Representation for Video QA.用于视频问答的自适应时空图增强视觉语言表示

IEEE Trans Image Process. 2021;30:5477-5489. doi: 10.1109/TIP.2021.3076556. Epub 2021 Jun 11.

Relational Reasoning Over Spatial-Temporal Graphs for Video Summarization.用于视频摘要的时空图关系推理

IEEE Trans Image Process. 2022;31:3017-3031. doi: 10.1109/TIP.2022.3163855. Epub 2022 Apr 11.

Adversarial Reinforcement Learning With Object-Scene Relational Graph for Video Captioning.用于视频字幕的基于对象-场景关系图的对抗强化学习。

IEEE Trans Image Process. 2022;31:2004-2016. doi: 10.1109/TIP.2022.3148868. Epub 2022 Feb 25.

引用本文的文献

Video Captioning Using Global-Local Representation.使用全局-局部表示的视频字幕

IEEE Trans Circuits Syst Video Technol. 2022 Oct;32(10):6642-6656. doi: 10.1109/tcsvt.2022.3177320. Epub 2022 May 23.

Design of Neural Network Model for Cross-Media Audio and Video Score Recognition Based on Convolutional Neural Network Model.基于卷积神经网络模型的跨媒体音视频评分识别神经网络模型设计。

Comput Intell Neurosci. 2022 Jun 13;2022:4626867. doi: 10.1155/2022/4626867. eCollection 2022.

Video captioning based on vision transformer and reinforcement learning.

具有目标感知时空相关性与聚合的视频字幕

Video Captioning with Object-Aware Spatio-Temporal Correlation and Aggregation.

作者信息

Zhang Junchao, Peng Yuxin

出版信息

IEEE Trans Image Process. 2020 Apr 27. doi: 10.1109/TIP.2020.2988435.

DOI:10.1109/TIP.2020.2988435

PMID:32356746

Abstract

摘要

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

具有目标感知时空相关性与聚合的视频字幕

Video Captioning with Object-Aware Spatio-Temporal Correlation and Aggregation.

作者信息

出版信息

相似文献

引用本文的文献

具有目标感知时空相关性与聚合的视频字幕

Video Captioning with Object-Aware Spatio-Temporal Correlation and Aggregation.

作者信息

出版信息

相似文献

引用本文的文献