Bian Cunling, Yang Yang, Wang Tao, Lu Weigang
Department of Education, Ocean University of China, Qingdao 266100, China.
Department of Campus Security, Ocean University of China, Qingdao 266100, China.
Sensors (Basel). 2025 May 16;25(10):3146. doi: 10.3390/s25103146.
Skeleton representation learning offers substantial advantages for action recognition by encoding intricate motion details and spatial-temporal dependencies among joints. However, fully supervised approaches necessitate large amounts of annotated data, which are often labor-intensive and costly to acquire. In this work, we propose the Spatial-Temporal Heatmap Masked Autoencoder (STH-MAE), a novel self-supervised framework tailored for skeleton-based action recognition. Unlike coordinate-based methods, STH-MAE adopts heatmap volumes as its primary representation, mitigating noise inherent in pose estimation while capitalizing on advances in Vision Transformers. The framework constructs a spatial-temporal heatmap (STH) by aggregating 2D joint heatmaps across both spatial and temporal axes. This STH is partitioned into non-overlapping patches to facilitate local feature learning, with a masking strategy applied to randomly conceal portions of the input. During pre-training, a Vision Transformer-based autoencoder equipped with a lightweight prediction head reconstructs the masked regions, fostering the extraction of robust and transferable skeletal representations. Comprehensive experiments on the NTU RGB+D 60 and NTU RGB+D 120 benchmarks demonstrate the superiority of STH-MAE, achieving state-of-the-art performance under multiple evaluation protocols.
骨架表示学习通过编码关节之间复杂的运动细节和时空依赖性,为动作识别提供了显著优势。然而,完全监督的方法需要大量的标注数据,而这些数据的获取通常既耗费人力又成本高昂。在这项工作中,我们提出了时空热图掩码自动编码器(STH-MAE),这是一种专门为基于骨架的动作识别量身定制的新型自监督框架。与基于坐标的方法不同,STH-MAE采用热图体作为其主要表示形式,在利用视觉Transformer进展的同时减轻了姿态估计中固有的噪声。该框架通过在空间和时间轴上聚合二维关节热图来构建时空热图(STH)。这个STH被划分为不重叠的补丁以促进局部特征学习,并应用一种掩码策略来随机隐藏输入的部分内容。在预训练期间,一个配备轻量级预测头的基于视觉Transformer的自动编码器会重建被掩码的区域,促进鲁棒且可迁移的骨架表示的提取。在NTU RGB+D 60和NTU RGB+D 120基准上进行的全面实验证明了STH-MAE的优越性,在多种评估协议下都达到了当前最优的性能。