用于聚焦查询的视频摘要的查询偏差自注意力网络。

Query-biased Self-attentive Network for Query-focused Video Summarization.

作者信息

Xiao Shuwen, Zhao Zhou, Zhang Zijian, Guan Ziyu, Cai Deng

出版信息

IEEE Trans Image Process. 2020 Apr 13. doi: 10.1109/TIP.2020.2985868.

DOI:10.1109/TIP.2020.2985868

Abstract

This paper addresses the task of query-focused video summarization, which takes user queries and long videos as inputs and generates query-focused video summaries. Compared to video summarization, which mainly concentrates on finding the most diverse and representative visual contents as a summary, the task of query-focused video summarization considers the user's intent and the semantic meaning of generated summary. In this paper, we propose a method, named query-biased self-attentive network (QSAN) to tackle this challenge. Our key idea is to utilize the semantic information from video descriptions to generate a generic summary and then to combine the information from the query to generate a query-focused summary. Specifically, we first propose a hierarchical self-attentive network to model the relative relationship at three levels, which are different frames from a segment, different segments of the same video, textual information of video description and its related visual contents. We train the model on video caption dataset and employ a reinforced caption generator to generate a video description, which can help us locate important frames or shots. Then we build a query-aware scoring module to compute the query-relevant score for each shot and generate the query-focused summary. Extensive experiments on the benchmark dataset demonstrate the competitive performance of our approach compared to some methods.

摘要

本文探讨了以查询为重点的视频摘要任务，该任务将用户查询和长视频作为输入，并生成以查询为重点的视频摘要。与主要专注于找到最多样化和最具代表性的视觉内容作为摘要的视频摘要相比，以查询为重点的视频摘要任务考虑了用户意图和生成摘要的语义含义。在本文中，我们提出了一种名为查询偏差自注意力网络（QSAN）的方法来应对这一挑战。我们的关键思想是利用来自视频描述的语义信息生成一个通用摘要，然后结合来自查询的信息生成一个以查询为重点的摘要。具体来说，我们首先提出一个分层自注意力网络，在三个层面上对相对关系进行建模，这三个层面分别是一个片段中的不同帧、同一视频的不同片段、视频描述的文本信息及其相关视觉内容。我们在视频字幕数据集上训练模型，并使用一个强化字幕生成器来生成视频描述，这有助于我们定位重要的帧或镜头。然后我们构建一个查询感知评分模块，为每个镜头计算与查询相关的分数，并生成以查询为重点的摘要。在基准数据集上进行的大量实验表明，与一些方法相比，我们的方法具有竞争力。