Aly Mohammed, Fathi Islam S
Department of Artificial Intelligence, Faculty of Artificial Intelligence, Egyptian Russian University, Badr City, 11829, Egypt.
Department of Computer Science, Faculty of Information Technology, Ajloun National University, P. O. 43, Ajloun, 26810, Jordan.
Sci Rep. 2025 Jun 23;15(1):20253. doi: 10.1038/s41598-025-06344-8.
Gesture recognition plays a vital role in computer vision, especially for interpreting sign language and enabling human-computer interaction. Many existing methods struggle with challenges like heavy computational demands, difficulty in understanding long-range relationships, sensitivity to background noise, and poor performance in varied environments. While CNNs excel at capturing local details, they often miss the bigger picture. Vision Transformers, on the other hand, are better at modeling global context but usually require significantly more computational resources, limiting their use in real-time systems. To tackle these issues, we propose a Hybrid Transformer-CNN model that combines the strengths of both architectures. Our approach begins with CNN layers that extract detailed local features from both the overall hand and specific hand regions. These CNN features are then refined by a Vision Transformer module, which captures long-range dependencies and global contextual information within the gesture. This integration allows the model to effectively recognize subtle hand movements while maintaining computational efficiency. Tested on the ASL Alphabet dataset, our model achieves a high accuracy of 99.97%, runs at 110 frames per second, and requires only 5.0 GFLOPs-much less than traditional Vision Transformer models, which need over twice the computational power. Central to this success is our feature fusion strategy using element-wise multiplication, which helps the model focus on important gesture details while suppressing background noise. Additionally, we employ advanced data augmentation techniques and a training approach incorporating contrastive learning and domain adaptation to boost robustness. Overall, this work offers a practical and powerful solution for gesture recognition, striking an optimal balance between accuracy, speed, and efficiency-an important step toward real-world applications.
手势识别在计算机视觉中起着至关重要的作用,特别是在解读手语和实现人机交互方面。许多现有方法面临着诸如计算需求大、理解远距离关系困难、对背景噪声敏感以及在各种环境中性能不佳等挑战。虽然卷积神经网络(CNNs)擅长捕捉局部细节,但它们往往忽略了整体情况。另一方面,视觉Transformer在建模全局上下文方面表现更好,但通常需要更多的计算资源,这限制了它们在实时系统中的应用。为了解决这些问题,我们提出了一种混合Transformer-CNN模型,该模型结合了两种架构的优势。我们的方法首先通过CNN层从整个手部和特定手部区域提取详细的局部特征。然后,这些CNN特征由视觉Transformer模块进行细化,该模块捕捉手势中的远距离依赖关系和全局上下文信息。这种整合使模型能够有效地识别细微的手部动作,同时保持计算效率。在ASL字母数据集上进行测试时,我们的模型实现了99.97%的高精度,每秒运行110帧,仅需要5.0 GFLOPs,远低于传统的视觉Transformer模型,后者需要两倍以上的计算能力。这一成功的关键在于我们使用逐元素乘法的特征融合策略,该策略有助于模型专注于重要的手势细节,同时抑制背景噪声。此外,我们采用了先进的数据增强技术以及一种结合对比学习和域适应的训练方法来提高鲁棒性。总体而言,这项工作为手势识别提供了一个实用且强大的解决方案,在准确性、速度和效率之间取得了最佳平衡,这是迈向实际应用的重要一步。