Department of Electrical and Electronic Engineering, Yonsei University, Seoul 03722, Korea.
Clova AI Research, NAVER Corporation, Seongnam 13561, Korea.
Sensors (Basel). 2021 Dec 12;21(24):8309. doi: 10.3390/s21248309.
In recent years, human action recognition has been studied by many computer vision researchers. Recent studies have attempted to use two-stream networks using appearance and motion features, but most of these approaches focused on clip-level video action recognition. In contrast to traditional methods which generally used entire images, we propose a new human instance-level video action recognition framework. In this framework, we represent the instance-level features using human boxes and keypoints, and our action region features are used as the inputs of the temporal action head network, which makes our framework more discriminative. We also propose novel temporal action head networks consisting of various modules, which reflect various temporal dynamics well. In the experiment, the proposed models achieve comparable performance with the state-of-the-art approaches on two challenging datasets. Furthermore, we evaluate the proposed features and networks to verify the effectiveness of them. Finally, we analyze the confusion matrix and visualize the recognized actions at human instance level when there are several people.
近年来,许多计算机视觉研究人员研究了人类动作识别。最近的研究尝试使用使用外观和运动特征的双流网络,但这些方法大多集中在剪辑级别的视频动作识别上。与传统方法通常使用整个图像不同,我们提出了一种新的人类实例级视频动作识别框架。在这个框架中,我们使用人体框和关键点来表示实例级特征,我们的动作区域特征作为时间动作头网络的输入,这使得我们的框架更具辨别力。我们还提出了新的时间动作头网络,其中包含各种模块,这些模块很好地反映了各种时间动态。在实验中,所提出的模型在两个具有挑战性的数据集上与最先进的方法相比取得了相当的性能。此外,我们评估了所提出的特征和网络,以验证它们的有效性。最后,我们分析混淆矩阵并可视化当有几个人时在人类实例级别识别的动作。