IEEE Trans Image Process. 2021;30:5559-5572. doi: 10.1109/TIP.2021.3086082. Epub 2021 Jun 16.
Close your eyes and listen to music, one can easily imagine an actor dancing rhythmically along with the music. These dance movements are usually made up of dance movements you have seen before. In this paper, we propose to reproduce such an inherent capability of the human-being within a computer vision system. The proposed system consists of three modules. To explore the relationship between music and dance movements, we propose a cross-modal alignment module that focuses on dancing video clips, accompanied by pre-designed music, to learn a system that can judge the consistency between the visual features of pose sequences and the acoustic features of music. The learned model is then used in the imagination module to select a pose sequence for the given music. Such pose sequence selected from the music, however, is usually discontinuous. To solve this problem, in the spatial-temporal alignment module we develop a spatial alignment algorithm based on the tendency and periodicity of dance movements to predict dance movements between discontinuous fragments. In addition, the selected pose sequence is often misaligned with the music beat. To solve this problem, we further develop a temporal alignment algorithm to align the rhythm of music and dance. Finally, the processed pose sequence is used to synthesize realistic dancing videos in the imagination module. The generated dancing videos match the content and rhythm of the music. Experimental results and subjective evaluations show that the proposed approach can perform the function of generating promising dancing videos by inputting music.
闭上眼睛听音乐,人们很容易想象出一个演员随着音乐有节奏地跳舞。这些舞蹈动作通常由之前见过的舞蹈动作组成。在本文中,我们提出在计算机视觉系统中再现人类的这种固有能力。所提出的系统由三个模块组成。为了探索音乐和舞蹈动作之间的关系,我们提出了一种跨模态对齐模块,该模块专注于伴随预先设计的音乐的舞蹈视频剪辑,以学习一种可以判断姿势序列的视觉特征与音乐的声学特征之间一致性的系统。然后,在想象模块中使用所学习的模型为给定的音乐选择姿势序列。然而,从音乐中选择的这种姿势序列通常是不连续的。为了解决这个问题,在时空对齐模块中,我们开发了一种基于舞蹈动作趋势和周期性的空间对齐算法,以预测不连续片段之间的舞蹈动作。此外,所选的姿势序列通常与音乐节拍不匹配。为了解决这个问题,我们进一步开发了一种时间对齐算法来对齐音乐和舞蹈的节奏。最后,在想象模块中使用处理后的姿势序列来合成逼真的舞蹈视频。生成的舞蹈视频与音乐的内容和节奏相匹配。实验结果和主观评价表明,该方法可以通过输入音乐来生成有前景的舞蹈视频。