IEEE Trans Cybern. 2019 May;49(5):1616-1628. doi: 10.1109/TCYB.2018.2806381. Epub 2018 Feb 27.
Desktop action recognition from first-person view (egocentric) video is an important task due to its omnipresence in our daily life, and the ideal first-person viewing perspective for observing hand-object interactions. However, no previous research efforts have been dedicated on the benchmark of the task. In this paper, we first release a dataset of daily desktop actions recorded with a wearable camera and publish it as a benchmark for desktop action recognition. Regular desktop activities of six participants were recorded in egocentric video with a wide-angle head-mounted camera. In particular, we focus on five common desktop actions in which hands are involved. We provide original video data, action annotations at frame-level, and hand masks at pixel-level. We also propose a feature representation for the characterization of different desktop actions based on the spatial and temporal information of hands. In experiments, we illustrate the statistical information about the dataset, and evaluate the action recognition performance of different features as a baseline. The proposed method achieves promising performance for five action classes.
从第一人称视角(自顶向下)视频中进行桌面动作识别是一项重要任务,因为它在我们的日常生活中无处不在,并且是观察手-物交互的理想第一人称视角。然而,以前没有任何研究致力于该任务的基准测试。在本文中,我们首次发布了一个使用可穿戴相机记录的日常桌面动作数据集,并将其作为桌面动作识别的基准进行发布。六个参与者的常规桌面活动使用广角头戴式相机进行自顶向下视频记录。特别地,我们关注涉及手部的五种常见桌面动作。我们提供原始视频数据、逐帧的动作注释以及逐像素的手部蒙版。我们还提出了一种基于手部的空间和时间信息来描述不同桌面动作的特征表示方法。在实验中,我们说明了数据集的统计信息,并评估了不同特征的动作识别性能作为基线。所提出的方法在五类动作中取得了有希望的性能。