IEEE Trans Pattern Anal Mach Intell. 2017 Aug;39(8):1617-1632. doi: 10.1109/TPAMI.2016.2608901. Epub 2016 Sep 13.
Pooling plays an important role in generating a discriminative video representation. In this paper, we propose a new semantic pooling approach for challenging event analysis tasks (e.g., event detection, recognition, and recounting) in long untrimmed Internet videos, especially when only a few shots/segments are relevant to the event of interest while many other shots are irrelevant or even misleading. The commonly adopted pooling strategies aggregate the shots indifferently in one way or another, resulting in a great loss of information. Instead, in this work we first define a novel notion of semantic saliency that assesses the relevance of each shot with the event of interest. We then prioritize the shots according to their saliency scores since shots that are semantically more salient are expected to contribute more to the final event analysis. Next, we propose a new isotonic regularizer that is able to exploit the constructed semantic ordering information. The resulting nearly-isotonic support vector machine classifier exhibits higher discriminative power in event analysis tasks. Computationally, we develop an efficient implementation using the proximal gradient algorithm, and we prove new and closed-form proximal steps. We conduct extensive experiments on three real-world video datasets and achieve promising improvements.
池化在生成判别性视频表示方面起着重要作用。在本文中,我们提出了一种新的语义池化方法,用于处理具有挑战性的事件分析任务(例如,事件检测、识别和重述),特别是当只有少数几个镜头/片段与感兴趣的事件相关,而许多其他镜头是不相关的甚至是误导性的。通常采用的池化策略以一种或另一种方式不加区分地聚合镜头,导致信息大量丢失。相反,在这项工作中,我们首先定义了一种新的语义显着性概念,用于评估每个镜头与感兴趣事件的相关性。然后,我们根据它们的显着性得分对镜头进行优先级排序,因为语义上更显着的镜头有望对最终的事件分析做出更大的贡献。接下来,我们提出了一种新的保序正则化器,能够利用构建的语义排序信息。由此产生的近保序支持向量机分类器在事件分析任务中表现出更高的判别能力。在计算方面,我们使用近端梯度算法开发了一种高效的实现,并证明了新的闭式近端步骤。我们在三个真实视频数据集上进行了广泛的实验,取得了有希望的改进。