Weng Zhengkui, Li Xinmin, Xiong Shoujian
School of Automation, Qingdao University, Qingdao, 266071, China.
School of Internet, Jiaxing Vocational and Technical College, Jiaxing, 314036, China.
Sci Rep. 2024 Oct 31;14(1):26202. doi: 10.1038/s41598-024-75640-6.
In the field of human action recognition, it is a long-standing challenge to characterize the video-level spatio-temporal features effectively. This is attributable in part to the inability of CNN to model long-range temporal information, especially for actions that consist of multiple staged behaviors. In this paper, a novel attention-based spatio-temporal VLAD network (AST-VLAD) with self-attention model is developed to aggregate the informative deep features across the video according to the adaptive deep feature selected. Moreover, an overall automatic approach to adaptive video sequences optimization (AVSO) is proposed through shot segmentation and dynamic weighted sampling, the AVSO increase in the proportion of action-related frames and eliminate the redundant intervals. Then, based on the optimized video, a self-attention model is introduced in AST-VLAD to modeling the intrinsic spatio-temporal relationship of deep features instead of solving the frame-level features in an average or max pooling manner. Extensive experiments are conducted on two public benchmarks-HMDB51 and UCF101 for evaluation. As compared to the existing frameworks, results show that the proposed approach performs better or as well in the accuracy of classification on both HMDB51 (73.1% ) and UCF101 (96.0%) datasets.
在人类动作识别领域,有效表征视频级时空特征一直是一项长期挑战。这部分归因于卷积神经网络(CNN)无法对长程时间信息进行建模,尤其是对于由多个阶段行为组成的动作。本文开发了一种具有自注意力模型的新型基于注意力的时空向量局部聚集描述符网络(AST-VLAD),以根据所选的自适应深度特征聚合视频中的信息丰富的深度特征。此外,通过镜头分割和动态加权采样提出了一种自适应视频序列优化(AVSO)的整体自动方法,AVSO增加了与动作相关帧的比例并消除了冗余间隔。然后,基于优化后的视频,在AST-VLAD中引入自注意力模型,以对深度特征的内在时空关系进行建模,而不是以平均或最大池化的方式求解帧级特征。在两个公共基准数据集——HMDB51和UCF101上进行了广泛的实验进行评估。与现有框架相比,结果表明,所提出的方法在HMDB51(73.1%)和UCF101(96.0%)数据集上的分类准确率方面表现更好或相当。