Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China.
Beijing Research Center for Information Technology, Agriculture, Beijing 100097, China.
Comput Intell Neurosci. 2021 Nov 18;2021:5044916. doi: 10.1155/2021/5044916. eCollection 2021.
Hand gesture recognition is a challenging topic in the field of computer vision. Multimodal hand gesture recognition based on RGB-D is with higher accuracy than that of only RGB or depth. It is not difficult to conclude that the gain originates from the complementary information existing in the two modalities. However, in reality, multimodal data are not always easy to acquire simultaneously, while unimodal RGB or depth hand gesture data are more general. Therefore, one hand gesture system is expected, in which only unimordal RGB or Depth data is supported for testing, while multimodal RGB-D data is available for training so as to attain the complementary information. Fortunately, a kind of method via multimodal training and unimodal testing has been proposed. However, unimodal feature representation and cross-modality transfer still need to be further improved. To this end, this paper proposes a new 3D-Ghost and Spatial Attention Inflated 3D ConvNet (3DGSAI) to extract high-quality features for each modality. The baseline of 3DGSAI network is Inflated 3D ConvNet (I3D), and two main improvements are proposed. One is 3D-Ghost module, and the other is the spatial attention mechanism. The 3D-Ghost module can extract richer features for hand gesture representation, and the spatial attention mechanism makes the network pay more attention to hand region. This paper also proposes an adaptive parameter for positive knowledge transfer, which ensures that the transfer always occurs from the strong modality network to the weak one. Extensive experiments on SKIG, VIVA, and NVGesture datasets demonstrate that our method is competitive with the state of the art. Especially, the performance of our method reaches 97.87% on the SKIG dataset using only RGB, which is the current best result.
手势识别是计算机视觉领域的一个具有挑战性的课题。基于 RGB-D 的多模态手势识别比仅基于 RGB 或深度的手势识别具有更高的准确性。不难得出结论,这种增益源于两种模式之间存在的互补信息。然而,在实际中,多模态数据并不总是容易同时获取,而单模态 RGB 或深度手势数据则更为常见。因此,人们期望有一种手势系统,该系统仅支持测试单模态 RGB 或深度数据,而同时可利用多模态 RGB-D 数据进行训练,以获取互补信息。幸运的是,已经提出了一种通过多模态训练和单模态测试的方法。然而,单模态特征表示和跨模态传输仍需要进一步改进。为此,本文提出了一种新的 3D-Ghost 和空间注意力膨胀 3D ConvNet(3DGSAI),以提取每种模式的高质量特征。3DGSAI 网络的基线是膨胀 3D ConvNet(I3D),并提出了两个主要改进。一个是 3D-Ghost 模块,另一个是空间注意力机制。3D-Ghost 模块可以对手势表示提取更丰富的特征,而空间注意力机制使网络对手部区域更加关注。本文还提出了一种用于正知识迁移的自适应参数,该参数确保迁移始终从强模态网络到弱模态网络发生。在 SKIG、VIVA 和 NVGesture 数据集上进行的广泛实验表明,我们的方法具有竞争力。特别是,我们的方法在仅使用 RGB 的 SKIG 数据集上的性能达到 97.87%,这是目前的最佳结果。