Wang Yifan, Gao Ang, Gong Yi, Zeng Yuan
IEEE Trans Vis Comput Graph. 2025 Oct;31(10):7876-7889. doi: 10.1109/TVCG.2025.3558468.
3D scene stylization refers to generating stylized images of the scene at arbitrary novel view angles following a given set of style images while ensuring consistency when rendered from different views. Recently, several 3D style transfer methods leveraging the scene reconstruction capabilities of pre-trained neural radiance fields (NeRF) have been proposed. To successfully stylize a scene this way, one must first reconstruct a photo-realistic radiance field from collected images of the scene. However, when only sparse input views are available, pre-trained few-shot NeRFs often suffer from high-frequency artifacts, which are generated as a by-product of high-frequency details for improving reconstruction quality. Is it possible to generate more faithful stylized scenes from sparse inputs by directly optimizing encoding-based scene representation with target style? In this paper, we consider the stylization of sparse-view scenes in terms of disentangling content semantics and style textures. We propose a coarse-to-fine sparse-view scene stylization framework, where a novel hierarchical encoding-based neural representation is designed to generate high-quality stylized scenes directly from implicit scene representations. We also propose a new optimization strategy with content strength annealing to achieve realistic stylization and better content preservation. Extensive experiments demonstrate that our method can achieve high-quality stylization of sparse-view scenes and outperforms fine-tuning-based baselines in terms of stylization quality and efficiency.