用于识别静态和动态手势的深度学习框架。

A Deep Learning Framework for Recognizing Both Static and Dynamic Gestures.

机构信息

LIRMM, Université de Montpellier, CNRS, 34392 Montpellier, France.

Cognitive Robotics Department, Delft University of Technology, 2628 CD Delft, The Netherlands.

出版信息

Sensors (Basel). 2021 Mar 23;21(6):2227. doi: 10.3390/s21062227.

DOI:10.3390/s21062227

PMID:33806741

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8004797/

Abstract

Intuitive user interfaces are indispensable to interact with the human centric smart environments. In this paper, we propose a unified framework that recognizes both static and dynamic gestures, using simple RGB vision (without depth sensing). This feature makes it suitable for inexpensive human-robot interaction in social or industrial settings. We employ a pose-driven spatial attention strategy, which guides our proposed Static and Dynamic gestures Network-. From the image of the human upper body, we estimate his/her depth, along with the region-of-interest around his/her hands. The Convolutional Neural Network (CNN) in is fine-tuned on a background-substituted hand gestures dataset. It is utilized to detect 10 static gestures for each hand as well as to obtain the hand image-embeddings. These are subsequently fused with the augmented pose vector and then passed to the stacked Long Short-Term Memory blocks. Thus, human-centred frame-wise information from the augmented pose vector and from the left/right hands image-embeddings are aggregated in time to predict the dynamic gestures of the performing person. In a number of experiments, we show that the proposed approach surpasses the state-of-the-art results on the large-scale dataset. Moreover, we transfer the knowledge learned through the proposed methodology to the dataset, and the obtained results also outscore the state-of-the-art on this dataset.

摘要

直观的用户界面对于与以人为中心的智能环境进行交互是不可或缺的。在本文中，我们提出了一个统一的框架，该框架使用简单的 RGB 视觉（无需深度感应）来识别静态和动态手势。此功能使其适用于社交或工业环境中廉价的人机交互。我们采用基于姿势的空间注意策略，指导我们提出的静态和动态手势网络。从人体上半身的图像中，我们估计他/她的深度以及他/她手周围的感兴趣区域。在经过背景替换的手部手势数据集上对进行微调。它用于检测每只手的 10 个静态手势，并获取手部图像嵌入。然后将这些与增强的姿势向量融合，并将其传递到堆叠的长短期记忆块。因此，来自增强的姿势向量和来自左手/右手图像嵌入的以人为中心的逐帧信息在时间上聚合，以预测执行人员的动态手势。在多项实验中，我们表明，所提出的方法在大型数据集上超过了最新技术的结果。此外，我们将通过所提出的方法学学到的知识转移到数据集上，并且获得的结果也在该数据集上优于最新技术的结果。