Chen Huizhou, Li Yunan, Fang Huijuan, Xin Wentian, Lu Zixiang, Miao Qiguang
School of Computer Science and Technology, Xidian University, Xi'an 710071, China.
Xiaomi Communications, Beijing 100085, China.
Sensors (Basel). 2022 Mar 21;22(6):2405. doi: 10.3390/s22062405.
Gesture recognition is an important direction in computer vision research. Information from the hands is crucial in this task. However, current methods consistently achieve attention on hand regions based on estimated keypoints, which will significantly increase both time and complexity, and may lose position information of the hand due to wrong keypoint estimations. Moreover, for dynamic gesture recognition, it is not enough to consider only the attention in the spatial dimension. This paper proposes a multi-scale attention 3D convolutional network for gesture recognition, with a fusion of multimodal data. The proposed network achieves attention mechanisms both locally and globally. The local attention leverages the hand information extracted by the hand detector to focus on the hand region, and reduces the interference of gesture-irrelevant factors. Global attention is achieved in both the human-posture context and the channel context through a dual spatiotemporal attention module. Furthermore, to make full use of the differences between different modalities of data, we designed a multimodal fusion scheme to fuse the features of RGB and depth data. The proposed method is evaluated using the Chalearn LAP Isolated Gesture Dataset and the Briareo Dataset. Experiments on these two datasets prove the effectiveness of our network and show it outperforms many state-of-the-art methods.
手势识别是计算机视觉研究中的一个重要方向。手部信息在这项任务中至关重要。然而,当前的方法始终基于估计的关键点来实现对手部区域的关注,这将显著增加时间和复杂度,并且可能由于错误的关键点估计而丢失手部的位置信息。此外,对于动态手势识别,仅考虑空间维度上的关注是不够的。本文提出了一种用于手势识别的多尺度注意力3D卷积网络,并融合了多模态数据。所提出的网络在局部和全局都实现了注意力机制。局部注意力利用手部检测器提取的手部信息来聚焦于手部区域,并减少与手势无关因素的干扰。通过双时空注意力模块在人体姿态上下文和通道上下文中实现全局注意力。此外,为了充分利用不同模态数据之间的差异,我们设计了一种多模态融合方案来融合RGB和深度数据的特征。使用Chalearn LAP孤立手势数据集和Briareo数据集对所提出的方法进行了评估。在这两个数据集上的实验证明了我们网络的有效性,并表明它优于许多现有方法。