Ullah Hayat, Munir Arslan
Department of Computer Science, Kansas State University, Manhattan, KS 66506, USA.
J Imaging. 2023 Jun 26;9(7):130. doi: 10.3390/jimaging9070130.
Vision-based human activity recognition (HAR) has emerged as one of the essential research areas in video analytics. Over the last decade, numerous advanced deep learning algorithms have been introduced to recognize complex human actions from video streams. These deep learning algorithms have shown impressive performance for the video analytics task. However, these newly introduced methods either exclusively focus on model performance or the effectiveness of these models in terms of computational efficiency, resulting in a biased trade-off between robustness and computational efficiency in their proposed methods to deal with challenging HAR problem. To enhance both the accuracy and computational efficiency, this paper presents a computationally efficient yet generic spatial-temporal cascaded framework that exploits the deep discriminative spatial and temporal features for HAR. For efficient representation of human actions, we propose an efficient dual attentional convolutional neural network (DA-CNN) architecture that leverages a unified channel-spatial attention mechanism to extract human-centric salient features in video frames. The dual channel-spatial attention layers together with the convolutional layers learn to be more selective in the spatial receptive fields having objects within the feature maps. The extracted discriminative salient features are then forwarded to a stacked bi-directional gated recurrent unit (Bi-GRU) for long-term temporal modeling and recognition of human actions using both forward and backward pass gradient learning. Extensive experiments are conducted on three publicly available human action datasets, where the obtained results verify the effectiveness of our proposed framework (DA-CNN+Bi-GRU) over the state-of-the-art methods in terms of model accuracy and inference runtime across each dataset. Experimental results show that the DA-CNN+Bi-GRU framework attains an improvement in execution time up to 167× in terms of frames per second as compared to most of the contemporary action-recognition methods.
基于视觉的人类活动识别(HAR)已成为视频分析中重要的研究领域之一。在过去十年中,人们引入了众多先进的深度学习算法来从视频流中识别复杂的人类动作。这些深度学习算法在视频分析任务中表现出了令人印象深刻的性能。然而,这些新引入的方法要么只专注于模型性能,要么只关注这些模型在计算效率方面的有效性,导致在其提出的处理具有挑战性的HAR问题的方法中,在鲁棒性和计算效率之间存在偏向性的权衡。为了提高准确性和计算效率,本文提出了一种计算高效且通用的时空级联框架,该框架利用深度判别性的空间和时间特征进行HAR。为了有效表示人类动作,我们提出了一种高效的双注意力卷积神经网络(DA-CNN)架构,该架构利用统一的通道-空间注意力机制在视频帧中提取以人类为中心的显著特征。双通道-空间注意力层与卷积层一起学习在特征图中具有对象的空间感受野中更具选择性。然后,将提取的判别性显著特征转发到堆叠的双向门控循环单元(Bi-GRU),用于使用前向和后向传递梯度学习进行人类动作的长期时间建模和识别。我们在三个公开可用的人类动作数据集上进行了广泛的实验,所获得的结果验证了我们提出的框架(DA-CNN+Bi-GRU)在每个数据集的模型准确性和推理运行时方面优于现有方法。实验结果表明,与大多数当代动作识别方法相比,DA-CNN+Bi-GRU框架在每秒帧数方面的执行时间提高了167倍。