IEEE Trans Image Process. 2015 Nov;24(11):4422-32. doi: 10.1109/TIP.2015.2465147. Epub 2015 Aug 5.
Action recognition in still images is a challenging problem in computer vision. To facilitate comparative evaluation independently of person detection, the standard evaluation protocol for action recognition uses an oracle person detector to obtain perfect bounding box information at both training and test time. The assumption is that, in practice, a general person detector will provide candidate bounding boxes for action recognition. In this paper, we argue that this paradigm is suboptimal and that action class labels should already be considered during the detection stage. Motivated by the observation that body pose is strongly conditioned on action class, we show that: 1) the existing state-of-the-art generic person detectors are not adequate for proposing candidate bounding boxes for action classification; 2) due to limited training examples, the direct training of action-specific person detectors is also inadequate; and 3) using only a small number of labeled action examples, the transfer learning is able to adapt an existing detector to propose higher quality bounding boxes for subsequent action classification. To the best of our knowledge, we are the first to investigate transfer learning for the task of action-specific person detection in still images. We perform extensive experiments on two benchmark data sets: 1) Stanford-40 and 2) PASCAL VOC 2012. For the action detection task (i.e., both person localization and classification of the action performed), our approach outperforms methods based on general person detection by 5.7% mean average precision (MAP) on Stanford-40 and 2.1% MAP on PASCAL VOC 2012. Our approach also significantly outperforms the state of the art with a MAP of 45.4% on Stanford-40 and 31.4% on PASCAL VOC 2012. We also evaluate our action detection approach for the task of action classification (i.e., recognizing actions without localizing them). For this task, our approach, without using any ground-truth person localization at test time, outperforms on both data sets state-of-the-art methods, which do use person locations.
静止图像中的动作识别是计算机视觉中的一个具有挑战性的问题。为了在不依赖于人员检测的情况下进行比较评估,动作识别的标准评估协议使用一个 oracle 人员探测器在训练和测试时获得完美的边界框信息。假设在实践中,一般的人员探测器将为动作识别提供候选边界框。在本文中,我们认为这种范例是次优的,并且在检测阶段就应该考虑动作类别标签。受身体姿势强烈取决于动作类别的观察结果的启发,我们表明:1)现有的最先进的通用人员探测器不足以提出用于动作分类的候选边界框;2)由于训练示例有限,直接训练特定于动作的人员探测器也不足;3)仅使用少量标记的动作示例,迁移学习能够适应现有的探测器,以便为后续的动作分类提出更高质量的边界框。据我们所知,我们是第一个研究在静止图像中特定于动作的人员检测任务的迁移学习的人。我们在两个基准数据集上进行了广泛的实验:1)斯坦福大学 40 人和 2)PASCAL VOC 2012。对于动作检测任务(即人员本地化和执行动作的分类),我们的方法在斯坦福大学 40 上的平均准确率(MAP)比基于通用人员检测的方法高出 5.7%,在 PASCAL VOC 2012 上的 MAP 高出 2.1%。我们的方法还在斯坦福大学 40 上的 MAP 达到 45.4%,在 PASCAL VOC 2012 上的 MAP 达到 31.4%,明显优于最新技术。我们还评估了我们的动作检测方法在动作分类任务中的应用(即无需定位人员即可识别动作)。对于这个任务,我们的方法在两个数据集上都优于使用人员位置的最新技术方法,而无需在测试时使用任何真实人员定位。