IEEE Trans Image Process. 2023;32:3013-3026. doi: 10.1109/TIP.2023.3275069. Epub 2023 May 26.
Video summarization aims to generate a compact summary of the original video for efficient video browsing. To provide video summaries which are consistent with the human perception and contain important content, supervised learning-based video summarization methods are proposed. These methods aim to learn important content based on continuous frame information of human-created summaries. However, simultaneously considering both of inter-frame correlations among non-adjacent frames and intra-frame attention which attracts the humans for frame importance representations are rarely discussed in recent methods. To address these issues, we propose a novel transformer-based method named spatiotemporal vision transformer (STVT) for video summarization. The STVT is composed of three dominant components including the embedded sequence module, temporal inter-frame attention (TIA) encoder, and spatial intra-frame attention (SIA) encoder. The embedded sequence module generates the embedded sequence by fusing the frame embedding, index embedding and segment class embedding to represent the frames. The temporal inter-frame correlations among non-adjacent frames are learned by the TIA encoder with the multi-head self-attention scheme. Then, the spatial intra-frame attention of each frame is learned by the SIA encoder. Finally, a multi-frame loss is computed to drive the learning of the network in an end-to-end trainable manner. By simultaneously using both inter-frame and intra-frame information, our method outperforms state-of-the-art methods in both of the SumMe and TVSum datasets. The source code of the spatiotemporal vision transformer will be available at https://github.com/nchucvml/STVT.
视频摘要旨在为高效的视频浏览生成原始视频的精简摘要。为了提供符合人类感知且包含重要内容的视频摘要,提出了基于监督学习的视频摘要方法。这些方法旨在根据人工创建的摘要的连续帧信息学习重要内容。然而,最近的方法很少同时考虑非相邻帧之间的帧间相关性以及吸引人类注意力的帧内注意力,以表示帧的重要性。为了解决这些问题,我们提出了一种名为时空视觉Transformer(STVT)的新型基于 Transformer 的方法,用于视频摘要。STVT 由三个主要组件组成,包括嵌入式序列模块、时间帧间注意力(TIA)编码器和空间帧内注意力(SIA)编码器。嵌入式序列模块通过融合帧嵌入、索引嵌入和段类嵌入来生成嵌入式序列,以表示帧。TIA 编码器使用多头自注意力机制学习非相邻帧之间的时间帧间相关性。然后,SIA 编码器学习每个帧的空间帧内注意力。最后,计算多帧损失以驱动网络以端到端可训练的方式进行学习。通过同时使用帧间和帧内信息,我们的方法在 SumMe 和 TVSum 数据集上均优于最先进的方法。时空视觉 Transformer 的源代码将在 https://github.com/nchucvml/STVT 上提供。