Al-Tawil Basheer, Jung Magnus, Hempel Thorsten, Al-Hamadi Ayoub
Neuro-Information Technology, Otto-von-Guericke-University Magdeburg, 39106 Magdeburg, Germany.
Sensors (Basel). 2025 May 6;25(9):2930. doi: 10.3390/s25092930.
Human action recognition (HAR) is essential for understanding and classifying human movements. It is widely used in real-life applications such as human-computer interaction and assistive robotics. However, recognizing patterns across different temporal scales remains challenging. Traditional methods struggle with complex timing patterns, intra-class variability, and inter-class similarities, leading to misclassifications. In this paper, we propose a deep learning framework for efficient and robust HAR. It integrates residual networks (ResNet-18) for spatial feature extraction and Bi-LSTM for temporal feature extraction. A multi-head attention mechanism enhances the prioritization of crucial motion details. Additionally, we introduce a motion-based frame selection strategy utilizing optical flow to reduce redundancy and enhance efficiency. This ensures accurate, real-time recognition of both simple and complex actions. We evaluate the framework on the UCF-101 dataset, achieving a 96.60% accuracy, demonstrating competitive performance against state-of-the-art approaches. Moreover, the framework operates at 222 frames per second (FPS), achieving an optimal balance between recognition performance and computational efficiency. The proposed framework was also deployed and tested on a mobile service robot, TIAGo, validating its real-time applicability in real-world scenarios. It effectively models human actions while minimizing frame dependency, making it well-suited for real-time applications.
人类动作识别(HAR)对于理解和分类人类运动至关重要。它广泛应用于诸如人机交互和辅助机器人等现实生活应用中。然而,识别不同时间尺度上的模式仍然具有挑战性。传统方法在处理复杂的时间模式、类内变异性和类间相似性方面存在困难,导致错误分类。在本文中,我们提出了一种用于高效且稳健的HAR的深度学习框架。它集成了用于空间特征提取的残差网络(ResNet-18)和用于时间特征提取的双向长短期记忆网络(Bi-LSTM)。一种多头注意力机制增强了对关键运动细节的优先级排序。此外,我们引入了一种基于运动的帧选择策略,利用光流来减少冗余并提高效率。这确保了对简单和复杂动作的准确实时识别。我们在UCF-101数据集上评估了该框架,达到了96.60%的准确率,展示了与现有最先进方法相比具有竞争力的性能。此外,该框架以每秒222帧(FPS)的速度运行,在识别性能和计算效率之间实现了最佳平衡。所提出的框架还在移动服务机器人TIAGo上进行了部署和测试,验证了其在现实场景中的实时适用性。它有效地对人类动作进行建模,同时最小化帧依赖性,使其非常适合实时应用。