用于多模态舞蹈表演评估和进展监测的视觉语言Transformer框架

Visual language transformer framework for multimodal dance performance evaluation and progression monitoring.

作者信息

Chen Lei

机构信息

Art College, Chengdu Sport University, Chengdu, 610041, China.

出版信息

Sci Rep. 2025 Aug 20;15(1):30649. doi: 10.1038/s41598-025-16345-2.

DOI:10.1038/s41598-025-16345-2

PMID:40835876

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12368089/

Abstract

Dance is often perceived as complex due to the need for coordinating multiple body movements and precisely aligning them with musical rhythm and content. Research in automatic dance performance assessment has the potential to enhance individuals' sensorimotor skills and motion analysis. Recent studies on dance performance assessment primarily focus on evaluating simple dance movements using a single task, typically estimating final performance scores. We propose a novel transformer-based visual-language framework for multi-modal dance performance evaluation and progression monitoring. Our approach addresses two core challenges: the learning of feature representations for complex dance movements synchronized with music across diverse styles, genres, and expertise levels, and capturing the multi-task nature of dance performance evaluation. To achieve this, we integrate contrastive self-supervised learning, spatiotemporal graph convolutional networks (STGCN), long short-term memory networks (LSTM), and transformer-based text prompting. Our model evaluates three key tasks: (i) multilabel dance classification, (ii) dance quality estimation, and (iii) dance-music synchronization, leveraging primitive-based segmentation and multi-modal inputs. During the pre-training phase, we utilize contrastive loss to capture primitive-based features from complex dance motion and music data. For downstream tasks, we propose a transformer-based text prompting approach to conduct multi-task evaluations for the three assessment objectives. Our model outperforms in diverse downstream tasks. For multilabel dance classification, our model achieves a score of 75.20, representing a 10.25% improvement over CotrastiveDance, on the dance quality estimation task, the proposed model achieved a 92.09% lower loss on CotrastiveDance. For dance-music synchronization, our model excels with a score of 2.52, outperforming CotrastiveDance by 48.67%.

摘要

由于需要协调多个身体动作并将它们与音乐节奏和内容精确对齐，舞蹈常常被视为复杂的活动。自动舞蹈表演评估的研究有潜力提高个人的感觉运动技能和动作分析能力。最近关于舞蹈表演评估的研究主要集中在使用单一任务评估简单的舞蹈动作，通常是估计最终表演分数。我们提出了一种基于Transformer的新颖视觉语言框架，用于多模态舞蹈表演评估和进展监测。我们的方法解决了两个核心挑战：学习与不同风格、流派和专业水平的音乐同步的复杂舞蹈动作的特征表示，以及捕捉舞蹈表演评估的多任务性质。为了实现这一点，我们整合了对比自监督学习、时空图卷积网络（STGCN）、长短期记忆网络（LSTM）和基于Transformer的文本提示。我们的模型评估三个关键任务：（i）多标签舞蹈分类，（ii）舞蹈质量估计，以及（iii）舞蹈与音乐同步，利用基于基元的分割和多模态输入。在预训练阶段，我们利用对比损失从复杂的舞蹈动作和音乐数据中捕捉基于基元的特征。对于下游任务，我们提出了一种基于Transformer的文本提示方法，对三个评估目标进行多任务评估。我们的模型在各种下游任务中表现出色。在多标签舞蹈分类方面，我们的模型得分为75.20，比CotrastiveDance提高了10.25%；在舞蹈质量估计任务上，所提出的模型在CotrastiveDance上的损失降低了92.09%。在舞蹈与音乐同步方面，我们的模型表现出色，得分为2.52，比CotrastiveDance高出48.67%。