Tang Junshu, Zhang Bo, Yang Binxin, Zhang Ting, Chen Dong, Ma Lizhuang, Wen Fang
IEEE Trans Vis Comput Graph. 2024 Sep;30(9):6020-6037. doi: 10.1109/TVCG.2023.3323578. Epub 2024 Jul 31.
In contrast to the traditional avatar creation pipeline which is a costly process, contemporary generative approaches directly learn the data distribution from photographs. While plenty of works extend unconditional generative models and achieve some levels of controllability, it is still challenging to ensure multi-view consistency, especially in large poses. In this work, we propose a network that generates 3D-aware portraits while being controllable according to semantic parameters regarding pose, identity, expression and illumination. Our network uses neural scene representation to model 3D-aware portraits, whose generation is guided by a parametric face model that supports explicit control. While the latent disentanglement can be further enhanced by contrasting images with partially different attributes, there still exists noticeable inconsistency in non-face areas when animating expressions. We solve this by proposing a volume blending strategy in which we form a composite output by blending dynamic and static areas, with two parts segmented from the jointly learned semantic field. Our method outperforms prior arts in extensive experiments, producing realistic portraits with vivid expression in natural lighting when viewed from free viewpoints. It also demonstrates generalization ability to real images as well as out-of-domain data, showing great promise in real applications.
与传统的虚拟形象创建流程(这是一个成本高昂的过程)不同,当代生成方法直接从照片中学习数据分布。虽然大量工作扩展了无条件生成模型并实现了一定程度的可控性,但确保多视图一致性仍然具有挑战性,尤其是在大姿态情况下。在这项工作中,我们提出了一个网络,该网络在根据关于姿态、身份、表情和光照的语义参数可控的同时生成3D感知肖像。我们的网络使用神经场景表示来对3D感知肖像进行建模,其生成由支持显式控制的参数化面部模型引导。虽然通过对比具有部分不同属性的图像可以进一步增强潜在解缠,但在动画表情时非面部区域仍存在明显的不一致。我们通过提出一种体积混合策略来解决这个问题,在该策略中,我们通过混合动态和静态区域形成一个复合输出,这两个部分从联合学习的语义场中分割出来。我们的方法在广泛的实验中优于现有技术,从自由视角观看时,在自然光照下生成具有生动表情的逼真肖像。它还展示了对真实图像以及域外数据的泛化能力,在实际应用中显示出巨大的潜力。