Department of Computer Science, Yonsei University, Seoul 03722, Korea.
Sensors (Basel). 2019 Mar 20;19(6):1382. doi: 10.3390/s19061382.
In action recognition research, two primary types of information are appearance and motion information that is learned from RGB images through visual sensors. However, depending on the action characteristics, contextual information, such as the existence of specific objects or globally-shared information in the image, becomes vital information to define the action. For example, the existence of the ball is vital information distinguishing "kicking" from "running". Furthermore, some actions share typical global abstract poses, which can be used as a key to classify actions. Based on these observations, we propose the multi-stream network model, which incorporates spatial, temporal, and contextual cues in the image for action recognition. We experimented on the proposed method using C3D or inflated 3D ConvNet (I3D) as a backbone network, regarding two different action recognition datasets. As a result, we observed overall improvement in accuracy, demonstrating the effectiveness of our proposed method.
在动作识别研究中,有两种主要的信息类型,即通过视觉传感器从 RGB 图像中学习到的外观和运动信息。然而,根据动作的特点,上下文信息(例如图像中特定对象的存在或全局共享信息)成为定义动作的重要信息。例如,球的存在是区分“踢”和“跑”的重要信息。此外,一些动作具有典型的全局抽象姿势,可以用作分类动作的关键。基于这些观察结果,我们提出了多流网络模型,该模型将图像中的空间、时间和上下文线索用于动作识别。我们使用 C3D 或膨胀 3D ConvNet(I3D)作为骨干网络在提出的方法上进行了实验,针对两个不同的动作识别数据集。结果表明,我们的方法在准确性方面整体得到了提高,证明了我们提出的方法的有效性。