• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于注意力的时空VLAD网络与自适应视频序列优化的动作识别

Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization.

作者信息

Weng Zhengkui, Li Xinmin, Xiong Shoujian

机构信息

School of Automation, Qingdao University, Qingdao, 266071, China.

School of Internet, Jiaxing Vocational and Technical College, Jiaxing, 314036, China.

出版信息

Sci Rep. 2024 Oct 31;14(1):26202. doi: 10.1038/s41598-024-75640-6.

DOI:10.1038/s41598-024-75640-6
PMID:39482337
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11527889/
Abstract

In the field of human action recognition, it is a long-standing challenge to characterize the video-level spatio-temporal features effectively. This is attributable in part to the inability of CNN to model long-range temporal information, especially for actions that consist of multiple staged behaviors. In this paper, a novel attention-based spatio-temporal VLAD network (AST-VLAD) with self-attention model is developed to aggregate the informative deep features across the video according to the adaptive deep feature selected. Moreover, an overall automatic approach to adaptive video sequences optimization (AVSO) is proposed through shot segmentation and dynamic weighted sampling, the AVSO increase in the proportion of action-related frames and eliminate the redundant intervals. Then, based on the optimized video, a self-attention model is introduced in AST-VLAD to modeling the intrinsic spatio-temporal relationship of deep features instead of solving the frame-level features in an average or max pooling manner. Extensive experiments are conducted on two public benchmarks-HMDB51 and UCF101 for evaluation. As compared to the existing frameworks, results show that the proposed approach performs better or as well in the accuracy of classification on both HMDB51 (73.1% ) and UCF101 (96.0%) datasets.

摘要

在人类动作识别领域,有效表征视频级时空特征一直是一项长期挑战。这部分归因于卷积神经网络(CNN)无法对长程时间信息进行建模,尤其是对于由多个阶段行为组成的动作。本文开发了一种具有自注意力模型的新型基于注意力的时空向量局部聚集描述符网络(AST-VLAD),以根据所选的自适应深度特征聚合视频中的信息丰富的深度特征。此外,通过镜头分割和动态加权采样提出了一种自适应视频序列优化(AVSO)的整体自动方法,AVSO增加了与动作相关帧的比例并消除了冗余间隔。然后,基于优化后的视频,在AST-VLAD中引入自注意力模型,以对深度特征的内在时空关系进行建模,而不是以平均或最大池化的方式求解帧级特征。在两个公共基准数据集——HMDB51和UCF101上进行了广泛的实验进行评估。与现有框架相比,结果表明,所提出的方法在HMDB51(73.1%)和UCF101(96.0%)数据集上的分类准确率方面表现更好或相当。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55ed/11527889/726520b90912/41598_2024_75640_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55ed/11527889/657636cef35a/41598_2024_75640_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55ed/11527889/140dad860a5d/41598_2024_75640_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55ed/11527889/140b2e2deba7/41598_2024_75640_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55ed/11527889/5057e3e42759/41598_2024_75640_Figa_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55ed/11527889/88a211721b08/41598_2024_75640_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55ed/11527889/e7c3fc4f7f06/41598_2024_75640_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55ed/11527889/2db44e1c72d2/41598_2024_75640_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55ed/11527889/726520b90912/41598_2024_75640_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55ed/11527889/657636cef35a/41598_2024_75640_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55ed/11527889/140dad860a5d/41598_2024_75640_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55ed/11527889/140b2e2deba7/41598_2024_75640_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55ed/11527889/5057e3e42759/41598_2024_75640_Figa_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55ed/11527889/88a211721b08/41598_2024_75640_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55ed/11527889/e7c3fc4f7f06/41598_2024_75640_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55ed/11527889/2db44e1c72d2/41598_2024_75640_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/55ed/11527889/726520b90912/41598_2024_75640_Fig7_HTML.jpg

相似文献

1
Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization.基于注意力的时空VLAD网络与自适应视频序列优化的动作识别
Sci Rep. 2024 Oct 31;14(1):26202. doi: 10.1038/s41598-024-75640-6.
2
Action-Stage Emphasized Spatio-Temporal VLAD for Video Action Recognition.用于视频动作识别的动作阶段强调时空VLAD
IEEE Trans Image Process. 2019 Jan 3. doi: 10.1109/TIP.2018.2890749.
3
Action Recognition Using Action Sequences Optimization and Two-Stream 3D Dilated Neural Network.基于动作序列优化和双流 3D 扩张神经网络的动作识别
Comput Intell Neurosci. 2022 Jun 13;2022:6608448. doi: 10.1155/2022/6608448. eCollection 2022.
4
MEST: An Action Recognition Network with Motion Encoder and Spatio-Temporal Module.MEST:一种具有运动编码器和时空模块的动作识别网络。
Sensors (Basel). 2022 Sep 1;22(17):6595. doi: 10.3390/s22176595.
5
Sequential Video VLAD: Training the Aggregation Locally and Temporally.序贯视频 VLAD:局部和时间聚合的训练。
IEEE Trans Image Process. 2018 Oct;27(10):4933-4944. doi: 10.1109/TIP.2018.2846664.
6
Interaction-Aware Spatio-Temporal Pyramid Attention Networks for Action Classification.基于交互感知时空金字塔注意力网络的动作分类。
IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7010-7028. doi: 10.1109/TPAMI.2021.3100277. Epub 2022 Sep 14.
7
STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video.STA-TSN:用于视频动作识别的时空注意力时间段网络。
PLoS One. 2022 Mar 17;17(3):e0265115. doi: 10.1371/journal.pone.0265115. eCollection 2022.
8
Histogram of Oriented Gradient-Based Fusion of Features for Human Action Recognition in Action Video Sequences.基于方向梯度直方图的动作视频序列中人体动作识别特征融合直方图
Sensors (Basel). 2020 Dec 18;20(24):7299. doi: 10.3390/s20247299.
9
AMS-Net: Modeling Adaptive Multi-Granularity Spatio-Temporal Cues for Video Action Recognition.AMS-Net:用于视频动作识别的自适应多粒度时空线索建模
IEEE Trans Neural Netw Learn Syst. 2024 Dec;35(12):18731-18745. doi: 10.1109/TNNLS.2023.3321141. Epub 2024 Dec 2.
10
Long-Term Temporal Convolutions for Action Recognition.长期时间卷积用于动作识别。
IEEE Trans Pattern Anal Mach Intell. 2018 Jun;40(6):1510-1517. doi: 10.1109/TPAMI.2017.2712608. Epub 2017 Jun 6.

引用本文的文献

1
Learning behavior aware features across spaces for improved 3D human motion prediction.跨空间学习行为感知特征以改进3D人体运动预测。
Sci Rep. 2025 Aug 4;15(1):28355. doi: 10.1038/s41598-025-11073-z.
2
AI-driven video summarization for optimizing content retrieval and management through deep learning techniques.通过深度学习技术实现的人工智能驱动的视频摘要,用于优化内容检索和管理。
Sci Rep. 2025 Feb 3;15(1):4058. doi: 10.1038/s41598-025-87824-9.

本文引用的文献

1
Rethinking Attentive Object Detection via Neural Attention Learning.通过神经注意力学习重新思考注意力目标检测
IEEE Trans Image Process. 2024;33:1726-1739. doi: 10.1109/TIP.2023.3251693. Epub 2024 Mar 7.
2
ASNet: Auto-Augmented Siamese Neural Network for Action Recognition.ASNet:用于动作识别的自动增强型孪生神经网络。
Sensors (Basel). 2021 Jul 10;21(14):4720. doi: 10.3390/s21144720.
3
Action-Stage Emphasized Spatio-Temporal VLAD for Video Action Recognition.用于视频动作识别的动作阶段强调时空VLAD
IEEE Trans Image Process. 2019 Jan 3. doi: 10.1109/TIP.2018.2890749.
4
Sequential Video VLAD: Training the Aggregation Locally and Temporally.序贯视频 VLAD:局部和时间聚合的训练。
IEEE Trans Image Process. 2018 Oct;27(10):4933-4944. doi: 10.1109/TIP.2018.2846664.
5
NetVLAD: CNN Architecture for Weakly Supervised Place Recognition.NetVLAD:用于弱监督场景识别的卷积神经网络架构。
IEEE Trans Pattern Anal Mach Intell. 2018 Jun;40(6):1437-1451. doi: 10.1109/TPAMI.2017.2711011. Epub 2017 Jun 1.
6
Long-Term Temporal Convolutions for Action Recognition.长期时间卷积用于动作识别。
IEEE Trans Pattern Anal Mach Intell. 2018 Jun;40(6):1510-1517. doi: 10.1109/TPAMI.2017.2712608. Epub 2017 Jun 6.