Xie Zhenyu, Zhao Fuwei, Zheng Jun, Dong Xin, Zhu Feida, Liang Xiaodan
IEEE Trans Pattern Anal Mach Intell. 2025 Jul 21;PP. doi: 10.1109/TPAMI.2025.3591072.
We introduce a monocular-to-3D virtual try-on network based on a conditional 3D-aware Generative Adversarial Network (3D-GAN) for synthesizing multi-view try-on results from single monocular images. In contrast to previous 3D virtual try-on methods that rely on costly scanned meshes or pseudo-depth maps for supervision, our approach utilizes a conditional 3D-GAN trained solely on 2D images, greatly simplifying dataset construction and enhancing model scalability. Specifically, we propose a Generative monocular-to-3D Virtual Try-ON network (G3D-VTON) that integrates a 3D-aware conditional Parsing Module (3DPM), a U-Net Refinement Module (URM), and a Flow-based 2D Virtual Try-On Module (FTM). In our framework, the 3DPM is designed to generate a 3D representation of the virtual try-on result, thereby enabling multi-view rendering. To accomplish this, it is implemented using conditional generative semantic articulated fields, which leverage the 3D SMPL prior via inverse skinning to learn the Signed Distance Function (SDF) of the try-on results in a canonical pose space. This learned SDF enables the rendering of both a coarse human parsing map and a preliminary try-on output with explicit camera control. Furthermore, within 3DPM, we introduce deferred pose guidance to decouple style and pose conditions during training, thereby facilitating view controllable generation during inference. However, the rendered human parsing and try-on results exhibit imprecise shapes and blurry textures. To address these issues, the URM subsequently refines these rendered outputs using a refinement U-Net, and the FTM integrates the refined results with the 2D warped garment to generate the final try-on output with more accurate and realistic appearance details. Extensive experiments demonstrate that the proposed G3D-VTON effectively manipulates and generates faithful 3D human appearances wearing the desired garment, outperforming both 3D-GAN and depth-based 3D approaches while delivering superior visual results in 2D.
我们基于条件3D感知生成对抗网络(3D-GAN)引入了一种单目到3D的虚拟试穿网络,用于从单目图像合成多视图试穿结果。与以往依赖昂贵的扫描网格或伪深度图进行监督的3D虚拟试穿方法不同,我们的方法利用仅在2D图像上训练的条件3D-GAN,极大地简化了数据集构建并提高了模型的可扩展性。具体而言,我们提出了一种生成式单目到3D虚拟试穿网络(G3D-VTON),它集成了一个3D感知条件解析模块(3DPM)、一个U-Net细化模块(URM)和一个基于流的2D虚拟试穿模块(FTM)。在我们的框架中,3DPM旨在生成虚拟试穿结果的3D表示,从而实现多视图渲染。为了实现这一点,它使用条件生成语义关节场来实现,该场通过反向蒙皮利用3D SMPL先验知识,在规范姿态空间中学习试穿结果的符号距离函数(SDF)。这种学习到的SDF能够渲染粗略的人体解析图和具有明确相机控制的初步试穿输出。此外,在3DPM中,我们引入了延迟姿态引导,以在训练期间解耦风格和姿态条件,从而便于在推理期间进行视图可控生成。然而,渲染的人体解析和试穿结果呈现出不精确的形状和模糊的纹理。为了解决这些问题,URM随后使用细化U-Net对这些渲染输出进行细化,FTM将细化后的结果与2D变形服装集成,以生成具有更准确和逼真外观细节的最终试穿输出。大量实验表明,所提出的G3D-VTON能够有效地操纵并生成穿着所需服装的逼真3D人体外观,优于3D-GAN和基于深度的3D方法,同时在2D中提供卓越的视觉效果。