IEEE Trans Pattern Anal Mach Intell. 2020 Feb;42(2):357-370. doi: 10.1109/TPAMI.2018.2876842. Epub 2018 Oct 18.
In this work, we propose a novel model-based deep convolutional autoencoder that addresses the highly challenging problem of reconstructing a 3D human face from a single in-the-wild color image. To this end, we combine a convolutional encoder network with an expert-designed generative model that serves as decoder. The core innovation is the differentiable parametric decoder that encapsulates image formation analytically based on a generative model. Our decoder takes as input a code vector with exactly defined semantic meaning that encodes detailed face pose, shape, expression, skin reflectance, and scene illumination. Due to this new way of combining CNN-based with model-based face reconstruction, the CNN-based encoder learns to extract semantically meaningful parameters from a single monocular input image. For the first time, a CNN encoder and an expert-designed generative model can be trained end-to-end in an unsupervised manner, which renders training on very large (unlabeled) real world datasets feasible. The obtained reconstructions compare favorably to current state-of-the-art approaches in terms of quality and richness of representation. This work is an extended version of [1] , where we additionally present a stochastic vertex sampling technique for faster training of our networks, and moreover, we propose and evaluate analysis-by-synthesis and shape-from-shading refinement approaches to achieve a high-fidelity reconstruction.
在这项工作中,我们提出了一种新颖的基于模型的深度卷积自动编码器,旨在解决从单个野外彩色图像重建 3D 人脸的极具挑战性的问题。为此,我们将卷积编码器网络与专家设计的生成模型相结合,作为解码器。核心创新是可区分的参数解码器,它基于生成模型对图像形成进行分析。我们的解码器以具有明确定义语义的码向量作为输入,该码向量编码详细的人脸姿势、形状、表情、皮肤反射率和场景光照。由于这种将基于 CNN 的方法与基于模型的人脸重建相结合的新方法,基于 CNN 的编码器学会了从单个单目输入图像中提取语义上有意义的参数。首次可以以端到端的方式对 CNN 编码器和专家设计的生成模型进行无监督训练,从而使在非常大的(未标记)真实世界数据集上进行训练成为可能。与目前最先进的方法相比,我们获得的重建在质量和表示的丰富性方面都具有优势。这项工作是[1]的扩展版本,我们还提出了一种随机顶点采样技术,用于加快我们网络的训练,此外,我们还提出并评估了分析合成和阴影细化方法,以实现高保真重建。