IEEE Trans Image Process. 2021;30:9259-9269. doi: 10.1109/TIP.2021.3123549. Epub 2021 Nov 12.
Transferring human motion from a source to a target person poses great potential in computer vision and graphics applications. A crucial step is to manipulate sequential future motion while retaining the appearance characteristic. Previous work has either relied on crafted 3D human models or trained a separate model specifically for each target person, which is not scalable in practice. This work studies a more general setting, in which we aim to learn a single model to parsimoniously transfer motion from a source video to any target person given only one image of the person, named as Collaborative Parsing-Flow Network (CPF-Net). The paucity of information regarding the target person makes the task particularly challenging to faithfully preserve the appearance in varying designated poses. To address this issue, CPF-Net integrates the structured human parsing and appearance flow to guide the realistic foreground synthesis which is merged into the background by a spatio-temporal fusion module. In particular, CPF-Net decouples the problem into stages of human parsing sequence generation, foreground sequence generation and final video generation. The human parsing generation stage captures both the pose and the body structure of the target. The appearance flow is beneficial to keep details in synthesized frames. The integration of human parsing and appearance flow effectively guides the generation of video frames with realistic appearance. Finally, the dedicated designed fusion network ensure the temporal coherence. We further collect a large set of human dancing videos to push forward this research field. Both quantitative and qualitative results show our method substantially improves over previous approaches and is able to generate appealing and photo-realistic target videos given any input person image. All source code and dataset will be released at https://github.com/xiezhy6/CPF-Net.
将人体运动从源体转移到目标体在计算机视觉和图形应用中具有巨大的潜力。关键步骤是在保留外观特征的同时操纵顺序未来运动。以前的工作要么依赖于精心制作的 3D 人体模型,要么针对每个目标人物专门训练一个单独的模型,这在实践中是不可扩展的。这项工作研究了一个更一般的设置,我们旨在学习一个单一的模型,以从源视频中节省地将运动转移到任何目标人物,只需该人物的一张图像,称为协作解析-流网络(CPF-Net)。由于目标人物的信息很少,因此要忠实地保留在不同指定姿势下的外观,任务特别具有挑战性。为了解决这个问题,CPF-Net 将人体解析和外观流集成在一起,以指导逼真的前景合成,该合成由时空融合模块合并到背景中。特别是,CPF-Net 将问题分为人体解析序列生成、前景序列生成和最终视频生成三个阶段。人体解析生成阶段捕获目标的姿势和身体结构。外观流有助于保持合成帧中的细节。人体解析和外观流的集成有效地指导了具有逼真外观的视频帧的生成。最后,专门设计的融合网络确保了时间一致性。我们进一步收集了大量的人类舞蹈视频,以推动这一研究领域的发展。定量和定性的结果都表明,我们的方法大大优于以前的方法,并且能够在给定任何输入人物图像的情况下生成吸引人的、逼真的目标视频。所有的源代码和数据集将在 https://github.com/xiezhy6/CPF-Net 上发布。