IEEE Trans Vis Comput Graph. 2020 Dec;26(12):3457-3466. doi: 10.1109/TVCG.2020.3023573. Epub 2020 Nov 10.
Video portraits are common in a variety of applications, such as videoconferencing, news broadcasting, and virtual education and training. We present a novel method to synthesize photorealistic video portraits for an input portrait video, automatically driven by a person's voice. The main challenge in this task is the hallucination of plausible, photorealistic facial expressions from input speech audio. To address this challenge, we employ a parametric 3D face model represented by geometry, facial expression, illumination, etc., and learn a mapping from audio features to model parameters. The input source audio is first represented as a high-dimensional feature, which is used to predict facial expression parameters of the 3D face model. We then replace the expression parameters computed from the original target video with the predicted one, and rerender the reenacted face. Finally, we generate a photorealistic video portrait from the reenacted synthetic face sequence via a neural face renderer. One appealing feature of our approach is the generalization capability for various input speech audio, including synthetic speech audio from text-to-speech software. Extensive experimental results show that our approach outperforms previous general-purpose audio-driven video portrait methods. This includes a user study demonstrating that our results are rated as more realistic than previous methods.
视频人像在各种应用中很常见,例如视频会议、新闻广播、虚拟教育和培训。我们提出了一种新的方法,可以根据输入的人像视频,自动合成逼真的人像视频,其主要驱动力是人的声音。这项任务的主要挑战是从输入的语音音频中生成逼真的、逼真的面部表情的幻觉。为了解决这个挑战,我们采用了一种由几何形状、面部表情、光照等参数化的 3D 人脸模型,并学习了从音频特征到模型参数的映射。输入的源音频首先被表示为一个高维特征,用于预测 3D 人脸模型的表情参数。然后,我们用预测的表情参数替换原始目标视频计算的表情参数,并重新渲染重新表演的人脸。最后,我们通过神经人脸渲染器从重新表演的合成人脸序列生成逼真的人像视频。我们的方法的一个吸引人的特点是对各种输入语音音频的泛化能力,包括来自文本到语音软件的合成语音音频。广泛的实验结果表明,我们的方法优于以前的通用音频驱动的人像视频方法。这包括一项用户研究表明,我们的结果被评为比以前的方法更逼真。