Ouyang Yuhao, Li Xiangqian
School of Computer Science and Technology, Beijing Jiaotong University, Beijing 100044, China.
Entropy (Basel). 2025 Mar 31;27(4):368. doi: 10.3390/e27040368.
This study proposes a three-dimensional (3D) residual attention network (3DRFNet) for human activity recognition by learning spatiotemporal representations from motion pictures. Core innovation integrates the attention mechanism into the 3D ResNet framework to emphasize key features and suppress irrelevant ones. In each 3D ResNet block, channel and spatial attention mechanisms generate attention maps for tensor segments, which are then multiplied by the input feature mapping to emphasize key features. Additionally, the integration of Fast Fourier Convolution (FFC) enhances the network's capability to effectively capture temporal and spatial features. Simultaneously, we used the cross-entropy loss function to describe the difference between the predicted value and GT to guide the model's backpropagation. Subsequent experimental results have demonstrated that 3DRFNet achieved SOTA performance in human action recognition. 3DRFNet achieved accuracies of 91.7% and 98.7% on the HMDB-51 and UCF-101 datasets, respectively, which highlighted 3DRFNet's advantages in recognition accuracy and robustness, particularly in effectively capturing key behavioral features in videos using both attention mechanisms.
本研究提出了一种三维(3D)残差注意力网络(3DRFNet),用于通过从运动图像中学习时空表征来进行人类活动识别。核心创新点是将注意力机制集成到3D ResNet框架中,以强调关键特征并抑制无关特征。在每个3D ResNet块中,通道和空间注意力机制为张量片段生成注意力图,然后将其与输入特征映射相乘,以强调关键特征。此外,快速傅里叶卷积(FFC)的集成增强了网络有效捕捉时空特征的能力。同时,我们使用交叉熵损失函数来描述预测值与真实值之间的差异,以指导模型的反向传播。后续实验结果表明,3DRFNet在人类动作识别中取得了最优性能。3DRFNet在HMDB-51和UCF-101数据集上分别达到了91.7%和98.7%的准确率,这突出了3DRFNet在识别准确率和鲁棒性方面的优势,特别是在使用两种注意力机制有效捕捉视频中的关键行为特征方面。