通过分层姿态引导的多阶段对比回归进行动作质量评估

Action Quality Assessment via Hierarchical Pose-Guided Multi-Stage Contrastive Regression.

作者信息

Qi Mengshi, Ye Hao, Peng Jiaxuan, Ma Huadong

出版信息

IEEE Trans Image Process. 2025;34:6461-6474. doi: 10.1109/TIP.2025.3613952.

DOI:10.1109/TIP.2025.3613952

Abstract

Action Quality Assessment (AQA), which aims at the automatic and fair evaluation of athletic performance, has gained increasing attention in recent years. However, athletes are often in rapid movement and the corresponding visual appearance variances are subtle, making it challenging to capture fine-grained pose differences and leading to poor estimation performance. Furthermore, most common AQA tasks, such as diving in sports, are usually divided into multiple sub-actions, each of which contains different durations. However, existing methods focus on segmenting the video into fixed frames, which disrupts the temporal continuity of sub-actions resulting in unavoidable prediction errors. To address these challenges, we propose a novel action quality assessment method through hierarchically pose-guided multi-stage contrastive regression. Firstly, we introduce a multi-scale dynamic visual-skeleton encoder to capture fine-grained spatio-temporal visual and skeletal features. Compared to mask or auxiliary visual features, skeletal features provide a more accurate representation during athletic movements. Then, a procedure segmentation network is introduced to separate different sub-actions and obtain segmented features. Afterwards, the segmented visual and skeletal features are both fed into a multi-modal fusion module as physics structural priors, to guide the model in learning refined activity similarities and variances. Finally, a multi-stage contrastive learning regression approach is employed to learn discriminative representations and output prediction results. In addition, we introduce a newly-annotated FineDiving-Pose Dataset to improve the current low-quality human pose labels. In experiments, the results on FineDiving and MTL-AQA datasets demonstrate the effectiveness and superiority of our proposed approach. Our source code and dataset are available at https://github.com/Lumos0507/HP-MCoRe.

摘要

动作质量评估（AQA）旨在对运动表现进行自动且公平的评估，近年来受到了越来越多的关注。然而，运动员通常处于快速运动中，相应的视觉外观差异很细微，这使得捕捉细粒度的姿势差异具有挑战性，并导致估计性能不佳。此外，大多数常见的AQA任务，如体育项目中的跳水，通常分为多个子动作，每个子动作包含不同的持续时间。然而，现有方法专注于将视频分割成固定的帧，这破坏了子动作的时间连续性，导致不可避免的预测误差。为了应对这些挑战，我们提出了一种通过分层姿势引导的多阶段对比回归的新型动作质量评估方法。首先，我们引入了一个多尺度动态视觉骨架编码器来捕捉细粒度的时空视觉和骨骼特征。与掩码或辅助视觉特征相比，骨骼特征在运动过程中提供了更准确的表示。然后，引入一个过程分割网络来分离不同的子动作并获得分割后的特征。之后，分割后的视觉和骨骼特征都作为物理结构先验输入到一个多模态融合模块中，以指导模型学习精细的活动相似性和差异。最后，采用多阶段对比学习回归方法来学习判别性表示并输出预测结果。此外，我们引入了一个新标注的FineDiving-Pose数据集来改善当前低质量的人体姿势标签。在实验中，在FineDiving和MTL-AQA数据集上的结果证明了我们提出的方法的有效性和优越性。我们的源代码和数据集可在https://github.com/Lumos0507/HP-MCoRe获取。