Suppr超能文献

基于 3D-Jointsformer 的实时单目手部骨骼手势识别。

Real-Time Monocular Skeleton-Based Hand Gesture Recognition Using 3D-Jointsformer.

机构信息

Grupo de Tratamiento de Imágenes (GTI), Information Processing and Telecommunications Center, ETSI Telecomunicación, Universidad Politécnica de Madrid, 28040 Madrid, Spain.

出版信息

Sensors (Basel). 2023 Aug 10;23(16):7066. doi: 10.3390/s23167066.

Abstract

Automatic hand gesture recognition in video sequences has widespread applications, ranging from home automation to sign language interpretation and clinical operations. The primary challenge lies in achieving real-time recognition while managing temporal dependencies that can impact performance. Existing methods employ 3D convolutional or Transformer-based architectures with hand skeleton estimation, but both have limitations. To address these challenges, a hybrid approach that combines 3D Convolutional Neural Networks (3D-CNNs) and Transformers is proposed. The method involves using a 3D-CNN to compute high-level semantic skeleton embeddings, capturing local spatial and temporal characteristics of hand gestures. A Transformer network with a self-attention mechanism is then employed to efficiently capture long-range temporal dependencies in the skeleton sequence. Evaluation of the Briareo and Multimodal Hand Gesture datasets resulted in accuracy scores of 95.49% and 97.25%, respectively. Notably, this approach achieves real-time performance using a standard CPU, distinguishing it from methods that require specialized GPUs. The hybrid approach's real-time efficiency and high accuracy demonstrate its superiority over existing state-of-the-art methods. In summary, the hybrid 3D-CNN and Transformer approach effectively addresses real-time recognition challenges and efficient handling of temporal dependencies, outperforming existing methods in both accuracy and speed.

摘要

自动视频序列中的手势识别具有广泛的应用,从家庭自动化到手语翻译和临床操作。主要的挑战在于在管理可能影响性能的时间依赖性的同时实现实时识别。现有的方法使用带有手部骨骼估计的 3D 卷积或基于 Transformer 的架构,但两者都有局限性。为了解决这些挑战,提出了一种结合 3D 卷积神经网络 (3D-CNN) 和 Transformer 的混合方法。该方法使用 3D-CNN 计算高级语义骨骼嵌入,捕获手部手势的局部空间和时间特征。然后使用具有自注意力机制的 Transformer 网络来有效地捕捉骨骼序列中的长程时间依赖性。在 Briareo 和 Multimodal Hand Gesture 数据集上的评估分别得到了 95.49%和 97.25%的准确率。值得注意的是,该方法使用标准 CPU 实现了实时性能,与需要专用 GPU 的方法区分开来。混合方法的实时效率和高精度表明其优于现有的最先进方法。总之,混合 3D-CNN 和 Transformer 方法有效地解决了实时识别挑战和时间依赖性的有效处理问题,在准确性和速度方面都优于现有的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fc53/10459010/52d3a13b5f2f/sensors-23-07066-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验