IEEE Trans Pattern Anal Mach Intell. 2016 Aug;38(8):1583-97. doi: 10.1109/TPAMI.2016.2537340. Epub 2016 Mar 2.
This paper describes a novel method called Deep Dynamic Neural Networks (DDNN) for multimodal gesture recognition. A semi-supervised hierarchical dynamic framework based on a Hidden Markov Model (HMM) is proposed for simultaneous gesture segmentation and recognition where skeleton joint information, depth and RGB images, are the multimodal input observations. Unlike most traditional approaches that rely on the construction of complex handcrafted features, our approach learns high-level spatio-temporal representations using deep neural networks suited to the input modality: a Gaussian-Bernouilli Deep Belief Network (DBN) to handle skeletal dynamics, and a 3D Convolutional Neural Network (3DCNN) to manage and fuse batches of depth and RGB images. This is achieved through the modeling and learning of the emission probabilities of the HMM required to infer the gesture sequence. This purely data driven approach achieves a Jaccard index score of 0.81 in the ChaLearn LAP gesture spotting challenge. The performance is on par with a variety of state-of-the-art hand-tuned feature-based approaches and other learning-based methods, therefore opening the door to the use of deep learning techniques in order to further explore multimodal time series data.
本文提出了一种新的方法,称为深度动态神经网络(DDNN),用于多模态手势识别。本文提出了一种基于隐马尔可夫模型(HMM)的半监督层次动态框架,用于同时进行手势分割和识别,其中骨架关节信息、深度和 RGB 图像是多模态输入观测值。与大多数传统方法依赖于构建复杂的手工制作特征不同,我们的方法使用适合输入模态的深度神经网络学习高级时空表示:使用高斯-伯努利深度置信网络(DBN)处理骨骼动力学,使用 3D 卷积神经网络(3DCNN)管理和融合深度和 RGB 图像批。这是通过对 HMM 的发射概率进行建模和学习来推断手势序列来实现的。这种纯数据驱动的方法在 ChaLearn LAP 手势定位挑战中实现了 0.81 的杰卡德指数得分。其性能与各种最先进的手工调整的基于特征的方法和其他基于学习的方法相当,因此为使用深度学习技术进一步探索多模态时间序列数据打开了大门。