Zhang Junchao, Peng Yuxin
IEEE Trans Image Process. 2020 Apr 27. doi: 10.1109/TIP.2020.2988435.
Video captioning is a significant challenging task in computer vision and natural language processing, aiming to automatically describe video content by natural language sentences. Comprehensive understanding of video is the key for accurate video captioning, which needs to not only capture the global content and salient objects in video, but also understand the spatio-temporal relations of objects, including their temporal trajectories and spatial relationships. Thus, it is important for video captioning to capture the objects' relationships both within and across frames. Therefore, in this paper, we propose an object-aware spatio-temporal graph (OSTG) approach for video captioning. It constructs spatio-temporal graphs to depict objects with their relations, where the temporal graphs represent objects' inter-frame dynamics, and the spatial graphs represent objects' intra-frame interactive relationships. The main novelties and advantages are: (1) Bidirectional temporal alignment: Bidirectional temporal graph is constructed along and reversely along the temporal order to perform bidirectional temporal alignment for objects across different frames, which provides complementary clues to capture the inter-frame temporal trajectories for each salient object. (2) Graph based spatial relation learning: Spatial relation graph is constructed among objects in each frame by considering their relative spatial locations and semantic correlations, which is exploited to learn relation features that encode intra-frame relationships for salient objects. (3) Object-aware feature aggregation: Trainable VLAD (vector of locally aggregated descriptors) models are deployed to perform object-aware feature aggregation on objects' local features, which learn discriminative aggregated representations for better video captioning. A hierarchical attention mechanism is also developed to distinguish contributions of different object instances. Experiments on two widely-used datasets, MSR-VTT and MSVD, demonstrate our proposed approach achieves state-of-the-art performances in terms of BLEU@4, METEOR and CIDEr metrics.
视频字幕生成是计算机视觉和自然语言处理中一项极具挑战性的任务,旨在通过自然语言句子自动描述视频内容。对视频的全面理解是准确生成视频字幕的关键,这不仅需要捕捉视频中的全局内容和显著物体,还需要理解物体的时空关系,包括它们的时间轨迹和空间关系。因此,捕捉帧内和跨帧的物体关系对于视频字幕生成至关重要。为此,在本文中,我们提出了一种用于视频字幕生成的物体感知时空图(OSTG)方法。它构建时空图来描绘物体及其关系,其中时间图表示物体的帧间动态,空间图表示物体的帧内交互关系。主要的新颖之处和优势在于:(1)双向时间对齐:沿时间顺序正向和反向构建双向时间图,对不同帧中的物体进行双向时间对齐,为每个显著物体捕捉帧间时间轨迹提供互补线索。(2)基于图的空间关系学习:通过考虑物体的相对空间位置和语义相关性,在每一帧中的物体之间构建空间关系图,利用该图学习编码显著物体帧内关系的关系特征。(3)物体感知特征聚合:部署可训练的VLAD(局部聚合描述符向量)模型对物体的局部特征进行物体感知特征聚合,学习判别性的聚合表示以实现更好的视频字幕生成。还开发了一种分层注意力机制来区分不同物体实例的贡献。在两个广泛使用的数据集MSR-VTT和MSVD上的实验表明,我们提出的方法在BLEU@4、METEOR和CIDEr指标方面取得了领先的性能。