Department of Applied and Cognitive Informatics, Graduate School of Science and Engineering, Chiba University, Chiba 263-8522, Japan.
Graduate School of Engineering, Chiba University, Chiba 263-8522, Japan.
Sensors (Basel). 2021 May 30;21(11):3793. doi: 10.3390/s21113793.
Large datasets are often used to improve the accuracy of action recognition. However, very large datasets are problematic as, for example, the annotation of large datasets is labor-intensive. This has encouraged research in zero-shot action recognition (ZSAR). Presently, most ZSAR methods recognize actions according to each video frame. These methods are affected by light, camera angle, and background, and most methods are unable to process time series data. The accuracy of the model is reduced owing to these reasons. In this paper, in order to solve these problems, we propose a three-stream graph convolutional network that processes both types of data. Our model has two parts. One part can process RGB data, which contains extensive useful information. The other part can process skeleton data, which is not affected by light and background. By combining these two outputs with a weighted sum, our model predicts the final results for ZSAR. Experiments conducted on three datasets demonstrate that our model has greater accuracy than a baseline model. Moreover, we also prove that our model can learn from human experience, which can make the model more accurate.
大型数据集通常用于提高动作识别的准确性。然而,非常大的数据集是有问题的,例如,大型数据集的注释是劳动密集型的。这鼓励了零镜头动作识别(ZSAR)的研究。目前,大多数 ZSAR 方法根据每个视频帧识别动作。这些方法受到光线、相机角度和背景的影响,并且大多数方法无法处理时间序列数据。由于这些原因,模型的准确性降低。在本文中,为了解决这些问题,我们提出了一种处理这两种类型数据的三流图卷积网络。我们的模型有两部分。一部分可以处理 RGB 数据,其中包含广泛的有用信息。另一部分可以处理骨骼数据,不受光线和背景的影响。通过将这两个输出与加权和相结合,我们的模型预测 ZSAR 的最终结果。在三个数据集上进行的实验表明,我们的模型比基线模型具有更高的准确性。此外,我们还证明我们的模型可以从人类经验中学习,这可以使模型更准确。