Do Minh N
IEEE Trans Image Process. 2016 Nov;25(11):5479-5490. doi: 10.1109/TIP.2016.2605305. Epub 2016 Sep 1.
We focus on the problem of still image-based human action recognition, which essentially involves making prediction by analyzing human poses and their interaction with objects in the scene. Besides image-level action labels (e.g., riding, phoning), during both training and testing stages, existing works usually require additional input of human bounding boxes to facilitate the characterization of the underlying human-object interactions. We argue that this additional input requirement might severely discourage potential applications and is not very necessary. To this end, a systematic approach was developed in this paper to address this challenging problem of minimum annotation efforts, i.e., to perform recognition in the presence of only image-level action labels in the training stage. Experimental results on three benchmark data sets demonstrate that compared with the state-of-the-art methods that have privileged access to additional human bounding-box annotations, our approach achieves comparable or even superior recognition accuracy using only action annotations in training. Interestingly, as a by-product in many cases, our approach is able to segment out the precise regions of underlying human-object interactions.
我们专注于基于静止图像的人类动作识别问题,该问题本质上涉及通过分析人体姿势及其与场景中物体的交互来进行预测。除了图像级别的动作标签(例如,骑行、打电话)之外,在训练和测试阶段,现有工作通常需要额外输入人体边界框,以促进对潜在的人体 - 物体交互的表征。我们认为,这种额外的输入要求可能会严重阻碍潜在应用,并且并非十分必要。为此,本文开发了一种系统方法来解决这个具有挑战性的最小标注工作量问题,即在训练阶段仅存在图像级动作标签的情况下进行识别。在三个基准数据集上的实验结果表明,与那些能够获取额外人体边界框标注的最先进方法相比,我们的方法在训练中仅使用动作标注就能达到相当甚至更高的识别准确率。有趣的是,在许多情况下,作为副产品,我们的方法能够分割出潜在人体 - 物体交互的精确区域。