IEEE Trans Vis Comput Graph. 2021 Oct;27(10):4009-4022. doi: 10.1109/TVCG.2020.2996594. Epub 2021 Sep 1.
Synthesizing realistic videos of humans using neural networks has been a popular alternative to the conventional graphics-based rendering pipeline due to its high efficiency. Existing works typically formulate this as an image-to-image translation problem in 2D screen space, which leads to artifacts such as over-smoothing, missing body parts, and temporal instability of fine-scale detail, such as pose-dependent wrinkles in the clothing. In this article, we propose a novel human video synthesis method that approaches these limiting factors by explicitly disentangling the learning of time-coherent fine-scale details from the embedding of the human in 2D screen space. More specifically, our method relies on the combination of two convolutional neural networks (CNNs). Given the pose information, the first CNN predicts a dynamic texture map that contains time-coherent high-frequency details, and the second CNN conditions the generation of the final video on the temporally coherent output of the first CNN. We demonstrate several applications of our approach, such as human reenactment and novel view synthesis from monocular video, where we show significant improvement over the state of the art both qualitatively and quantitatively.
使用神经网络合成逼真的人类视频,由于其高效性,已成为传统基于图形的渲染管道的热门替代方案。现有作品通常将其在 2D 屏幕空间中表述为图像到图像的转换问题,这会导致过度平滑、缺少身体部位以及精细细节的时间不稳定等伪影,例如服装上依赖姿势的皱纹。在本文中,我们提出了一种新颖的人类视频合成方法,通过明确将时间一致的精细细节的学习与人体在 2D 屏幕空间中的嵌入分开,来解决这些限制因素。具体来说,我们的方法依赖于两个卷积神经网络 (CNN) 的组合。给定姿势信息,第一个 CNN 预测包含时间一致的高频细节的动态纹理图,第二个 CNN 根据第一个 CNN 的时间一致输出条件生成最终视频。我们展示了我们的方法的几个应用,例如人类重现和从单目视频进行新视图合成,在定性和定量方面都明显优于现有技术。