GRAM, Department of Signal Theory and Communications, University of Alcalá, 28805 Alcalá de Henares, Spain.
Sensors (Basel). 2020 May 22;20(10):2953. doi: 10.3390/s20102953.
In this work, we introduce an intelligent video sensor for the problem of Action Proposals (AP). AP consists of localizing temporal segments in untrimmed videos that are likely to contain actions. Solving this problem can accelerate several video action understanding tasks, such as detection, retrieval, or indexing. All previous AP approaches are supervised and offline, i.e. they need both the temporal annotations of the datasets during training and access to the whole video to effectively cast the proposals. We propose here a new approach which, unlike the rest of the state-of-the-art models, is unsupervised. This implies that we do not allow it to see any labeled data during learning nor to work with any pre-trained feature on the used dataset. Moreover, our approach also operates in an online manner, which can be beneficial for many real-world applications where the video has to be processed as soon as it arrives at the sensor, e.g., robotics or video monitoring. The core of our method is based on a Support Vector Classifier (SVC) module which produces candidate segments for AP by distinguishing between sets of contiguous video frames. We further propose a mechanism to refine and filter those candidate segments. This filter optimizes a learning-to-rank formulation over the dynamics of the segments. An extensive experimental evaluation is conducted on Thumos'14 and ActivityNet datasets, and, to the best of our knowledge, this work supposes the first unsupervised approach on these main AP benchmarks. Finally, we also provide a thorough comparison to the current state-of-the-art supervised AP approaches. We achieve 41% and 59% of the performance of the best-supervised model on ActivityNet and Thumos'14, respectively, confirming our unsupervised solution as a correct option to tackle the AP problem. The code to reproduce all our results will be publicly released upon acceptance of the paper.
在这项工作中,我们引入了一种智能视频传感器,用于解决动作建议(AP)问题。AP 包括在未修剪的视频中定位可能包含动作的时间片段。解决这个问题可以加速几个视频动作理解任务,如检测、检索或索引。以前所有的 AP 方法都是有监督和离线的,也就是说,它们在训练过程中既需要数据集的时间标注,也需要访问整个视频,才能有效地提出建议。我们在这里提出了一种新的方法,与其他最先进的模型不同,它是无监督的。这意味着在学习过程中我们不允许它看到任何有标签的数据,也不允许在使用的数据集上使用任何预先训练的特征。此外,我们的方法还可以在线运行,这对于许多实时应用程序非常有益,例如机器人技术或视频监控,这些应用程序需要在视频到达传感器时立即对其进行处理。我们的方法的核心是基于支持向量分类器(SVC)模块,该模块通过区分连续的视频帧集来生成 AP 的候选片段。我们进一步提出了一种机制来优化和过滤这些候选片段。该过滤器通过对片段的动态进行学习到排序的公式进行优化。我们在 Thumos'14 和 ActivityNet 数据集上进行了广泛的实验评估,据我们所知,这是在这些主要的 AP 基准上首次提出的无监督方法。最后,我们还与当前最先进的有监督 AP 方法进行了全面比较。我们在 ActivityNet 和 Thumos'14 上分别实现了性能最好的有监督模型的 41%和 59%,这证实了我们的无监督解决方案是解决 AP 问题的正确选择。在论文接受后,我们将公开发布重现所有结果的代码。