Elnady Mahmoud, Abdelmunim Hossam E
Computer and Systems Engineering, Ain Shams University, El Sarayat, Cairo, 11517, Egypt.
Sci Rep. 2025 May 16;15(1):17036. doi: 10.1038/s41598-025-01898-z.
Human Action Recognition (HAR) is a critical task in computer vision with applications in surveillance, healthcare, and human-computer interaction. This paper introduces a novel approach combining the strengths of You Only Look Once (YOLO) for feature extraction and Long Short-Term Memory (LSTM) networks for temporal modeling to achieve robust and accurate action recognition in video sequences. The YOLO model efficiently identifies key features from individual frames, enabling real-time processing, while the LSTM network captures temporal dependencies to understand sequential dynamics in human movements. The proposed YOLO-LSTM framework is evaluated on multiple publicly available HAR datasets, achieving an accuracy of 96%, precision of 96%, recall of 97%, and F1-score of 96% on the UCF101 dataset; 99% across all metrics on the KTH dataset; 100% on the WEIZMANN dataset; and 98% on the IXMAS dataset. These results demonstrate the superior performance of our approach compared to existing methods in terms of both accuracy and processing speed. Additionally, this approach effectively handles challenges such as occlusions, varying illumination, and complex backgrounds, making it suitable for real-world applications. The results highlight the potential of combining object detection and recurrent architectures for advancing state-of-the-art HAR systems.
人体动作识别(HAR)是计算机视觉中的一项关键任务,在监控、医疗保健和人机交互等领域有着广泛应用。本文介绍了一种新颖的方法,该方法结合了用于特征提取的单阶段多框检测器(YOLO)的优势和用于时间建模的长短期记忆(LSTM)网络,以在视频序列中实现强大而准确的动作识别。YOLO模型能够有效地从各个帧中识别关键特征,实现实时处理,而LSTM网络则捕捉时间依赖性,以理解人类动作中的顺序动态。所提出的YOLO-LSTM框架在多个公开可用的HAR数据集上进行了评估,在UCF101数据集上达到了96%的准确率、96%的精确率、97%的召回率和96%的F1分数;在KTH数据集上所有指标均达到99%;在魏茨曼数据集上达到100%;在IXMAS数据集上达到98%。这些结果表明,与现有方法相比,我们的方法在准确性和处理速度方面都具有卓越的性能。此外,该方法有效地应对了遮挡、光照变化和复杂背景等挑战,使其适用于实际应用。结果突出了结合目标检测和循环架构以推进先进的HAR系统的潜力。