Department of Control Science and Engineering, Harbin Institute of Technology, Harbin 150001, China.
Sensors (Basel). 2024 Feb 11;24(4):1189. doi: 10.3390/s24041189.
This paper introduces 'SkeletonCLIP++', an extension of our prior work in human action recognition, emphasizing the use of semantic information beyond traditional label-based methods. The innovation, 'Weighted Frame Integration' (WFI), shifts video feature computation from averaging to a weighted frame approach, enabling a more nuanced representation of human movements in line with semantic relevance. Another key development, 'Contrastive Sample Identification' (CSI), introduces a novel discriminative task within the model. This task involves identifying the most similar negative sample among positive ones, enhancing the model's ability to distinguish between closely related actions. Incorporating the 'BERT Text Encoder Integration' (BTEI) leverages the pre-trained BERT model as our text encoder to refine the model's performance. Empirical evaluations on HMDB-51, UCF-101, and NTU RGB+D 60 datasets illustrate positive improvements, especially in smaller datasets. 'SkeletonCLIP++' thus offers a refined approach to human action recognition, ensuring semantic integrity and detailed differentiation in video data analysis.
本文介绍了“SkeletonCLIP++”,这是我们之前在人体动作识别工作的扩展,强调使用超越传统基于标签方法的语义信息。创新的“加权帧集成”(WFI)将视频特征计算从平均化转变为加权帧方法,使人体运动的表示更加细致入微,符合语义相关性。另一个关键的发展是“对比样本识别”(CSI),在模型中引入了一个新的判别任务。该任务涉及在正样本中识别最相似的负样本,从而提高模型区分密切相关动作的能力。结合“BERT 文本编码器集成”(BTEI),利用预训练的 BERT 模型作为我们的文本编码器,以提高模型的性能。在 HMDB-51、UCF-101 和 NTU RGB+D 60 数据集上的实证评估表明了积极的改进,特别是在较小的数据集上。因此,“SkeletonCLIP++”提供了一种更精细的人体动作识别方法,确保在视频数据分析中保持语义完整性和详细的区分。