Wang Jue, Cherian Anoop
IEEE Trans Pattern Anal Mach Intell. 2021 Feb;43(2):420-433. doi: 10.1109/TPAMI.2019.2937292. Epub 2021 Jan 8.
Most popular deep models for action recognition in videos generate independent predictions for short clips, which are then pooled heuristically to assign an action label to the full video segment. As not all frames may characterize the underlying action-indeed, many are common across multiple actions-pooling schemes that impose equal importance on all frames might be unfavorable. In an attempt to tackle this problem, we propose discriminative pooling, based on the notion that among the deep features generated on all short clips, there is at least one that characterizes the action. To identify these useful features, we resort to a negative bag consisting of features that are known to be irrelevant, for example, they are sampled either from datasets that are unrelated to our actions of interest or are CNN features produced via random noise as input. With the features from the video as a positive bag and the irrelevant features as the negative bag, we cast an objective to learn a (nonlinear) hyperplane that separates the unknown useful features from the rest in a multiple instance learning formulation within a support vector machine setup. We use the parameters of this separating hyperplane as a descriptor for the full video segment. Since these parameters are directly related to the support vectors in a max-margin framework, they can be treated as a weighted average pooling of the features from the bags, with zero weights given to non-support vectors. Our pooling scheme is end-to-end trainable within a deep learning framework. We report results from experiments on eight computer vision benchmark datasets spanning a variety of video-related tasks and demonstrate state-of-the-art performance across these tasks.
大多数用于视频动作识别的流行深度模型会为短视频片段生成独立预测,然后通过启发式合并来为整个视频片段分配动作标签。由于并非所有帧都能表征潜在动作——实际上,许多帧在多个动作中都很常见——对所有帧赋予同等重要性的合并方案可能并不理想。为了解决这个问题,我们提出了判别式合并,其基于这样一种观念:在所有短视频片段生成的深度特征中,至少有一个能表征动作。为了识别这些有用特征,我们借助一个负包,其中包含已知不相关的特征,例如,它们要么是从与我们感兴趣的动作无关的数据集中采样得到,要么是通过随机噪声作为输入产生的卷积神经网络特征。将视频中的特征作为正包,不相关特征作为负包,我们构建一个目标,在支持向量机设置下的多实例学习公式中学习一个(非线性)超平面,将未知的有用特征与其他特征区分开来。我们将这个分离超平面的参数用作整个视频片段的描述符。由于这些参数在最大间隔框架中与支持向量直接相关,它们可以被视为来自包的特征的加权平均合并,非支持向量的权重为零。我们的合并方案在深度学习框架内是端到端可训练的。我们报告了在八个涵盖各种视频相关任务的计算机视觉基准数据集上的实验结果,并展示了在这些任务中的最优性能。