Ji Zhong, Zhao Yuxiao, Pang Yanwei, Li Xi, Han Jungong
IEEE Trans Neural Netw Learn Syst. 2021 Apr;32(4):1765-1775. doi: 10.1109/TNNLS.2020.2991083. Epub 2021 Apr 2.
This article studies supervised video summarization by formulating it into a sequence-to-sequence learning framework, in which the input and output are sequences of original video frames and their predicted importance scores, respectively. Two critical issues are addressed in this article: short-term contextual attention insufficiency and distribution inconsistency. The former lies in the insufficiency of capturing the short-term contextual attention information within the video sequence itself since the existing approaches focus a lot on the long-term encoder-decoder attention. The latter refers to the distributions of predicted importance score sequence and the ground-truth sequence is inconsistent, which may lead to a suboptimal solution. To better mitigate the first issue, we incorporate a self-attention mechanism in the encoder to highlight the important keyframes in a short-term context. The proposed approach alongside the encoder-decoder attention constitutes our deep attentive models for video summarization. For the second one, we propose a distribution consistency learning method by employing a simple yet effective regularization loss term, which seeks a consistent distribution for the two sequences. Our final approach is dubbed as Attentive and Distribution consistent video Summarization (ADSum). Extensive experiments on benchmark data sets demonstrate the superiority of the proposed ADSum approach against state-of-the-art approaches.
本文通过将监督视频摘要问题构建为一个序列到序列的学习框架来进行研究,在该框架中,输入和输出分别是原始视频帧序列及其预测的重要性得分。本文解决了两个关键问题:短期上下文注意力不足和分布不一致。前者在于现有方法过多地关注长期编码器 - 解码器注意力,从而在视频序列本身内捕捉短期上下文注意力信息方面存在不足。后者指的是预测的重要性得分序列与真实序列的分布不一致,这可能导致次优解。为了更好地缓解第一个问题,我们在编码器中引入了自注意力机制,以在短期上下文中突出重要关键帧。所提出的方法与编码器 - 解码器注意力一起构成了我们用于视频摘要的深度注意力模型。对于第二个问题,我们通过采用一个简单而有效的正则化损失项提出了一种分布一致性学习方法,该方法寻求两个序列的一致分布。我们的最终方法被称为注意力与分布一致的视频摘要(ADSum)。在基准数据集上进行的大量实验证明了所提出的ADSum方法相对于现有方法的优越性。