Computer Vision Laboratory, ETH Zurich, Sternwartstrasse 7, CH-8092 Zurich, Switzerland.
IEEE Trans Pattern Anal Mach Intell. 2013 Apr;35(4):835-48. doi: 10.1109/TPAMI.2012.175.
We introduce an approach for learning human actions as interactions between persons and objects in realistic videos. Previous work typically represents actions with low-level features such as image gradients or optical flow. In contrast, we explicitly localize in space and track over time both the object and the person, and represent an action as the trajectory of the object w.r.t. to the person position. Our approach relies on state-of-the-art techniques for human detection, object detection, and tracking. We show that this results in human and object tracks of sufficient quality to model and localize human-object interactions in realistic videos. Our human-object interaction features capture the relative trajectory of the object w.r.t. the human. Experimental results on the Coffee and Cigarettes dataset, the video dataset of, and the Rochester Daily Activities dataset show that 1) our explicit human-object model is an informative cue for action recognition; 2) it is complementary to traditional low-level descriptors such as 3D--HOG extracted over human tracks. We show that combining our human-object interaction features with 3D-HOG improves compared to their individual performance as well as over the state of the art.
我们提出了一种方法,用于学习现实视频中人与人之间以及人与物体之间的交互作用的人类行为。以前的工作通常使用图像梯度或光流等低级特征来表示动作。相比之下,我们明确地在空间中定位并随时间跟踪物体和人,并将动作表示为物体相对于人的位置的轨迹。我们的方法依赖于用于人体检测、物体检测和跟踪的最先进技术。我们表明,这导致足够质量的人体和物体轨迹,以在现实视频中建模和定位人与人之间的相互作用。我们的人体-物体交互特征捕获物体相对于人体的相对轨迹。在 Coffee and Cigarettes 数据集、数据集和罗切斯特日常活动数据集上的实验结果表明:1)我们的显式人体-物体模型是动作识别的一个有用线索;2)它与人的轨迹上提取的传统低级描述符(如 3D-HOG)互补。我们表明,将我们的人体-物体交互特征与 3D-HOG 相结合,不仅比其单独的性能,而且比现有的最佳技术都有所提高。