Nie Weizhi, Chen Ruidong, Wang Weijie, Lepri Bruno, Sebe Nicu
IEEE Trans Pattern Anal Mach Intell. 2025 Jan;47(1):172-189. doi: 10.1109/TPAMI.2024.3463753. Epub 2024 Dec 4.
In recent years, 3D models have been utilized in many applications, such as auto-drivers, 3D reconstruction, VR, and AR. However, the scarcity of 3D model data does not meet its practical demands. Thus, generating high-quality 3D models efficiently from textual descriptions is a promising but challenging way to solve this problem. In this paper, inspired by the creative mechanisms of human imagination, which concretely supplement the target model from ambiguous descriptions built upon human experiential knowledge, we propose a novel text-3D generation model (T2TD). T2TD aims to generate the target model based on the textual description with the aid of experiential knowledge. Its target creation process simulates the imaginative mechanisms of human beings. In this process, we first introduce the text-3D knowledge graph to preserve the relationship between 3D models and textual semantic information, which provides related shapes like humans' experiential information. Second, we propose an effective causal inference model to select useful feature information from these related shapes, which can remove the unrelated structure information and only retain solely the feature information strongly related to the textual description. Third, we adopt a novel multi-layer transformer structure to progressively fuse this strongly related structure information and textual information, compensating for the lack of structural information, and enhancing the final performance of the 3D generation model. The final experimental results demonstrate that our approach significantly improves 3D model generation quality and outperforms the SOTA methods on the text2shape datasets.
近年来,3D模型已被应用于许多领域,如自动驾驶、3D重建、虚拟现实(VR)和增强现实(AR)。然而,3D模型数据的稀缺性无法满足其实际需求。因此,从文本描述中高效生成高质量的3D模型是解决这一问题的一种有前景但具有挑战性的方法。在本文中,受人类想象力的创作机制启发,具体而言,是从基于人类经验知识构建的模糊描述中补充目标模型,我们提出了一种新颖的文本到3D生成模型(T2TD)。T2TD旨在借助经验知识,根据文本描述生成目标模型。其目标创建过程模拟了人类的想象机制。在此过程中,我们首先引入文本-3D知识图谱,以保留3D模型与文本语义信息之间的关系,这提供了类似人类经验信息的相关形状。其次,我们提出了一种有效的因果推理模型,从这些相关形状中选择有用的特征信息,该模型可以去除不相关的结构信息,仅保留与文本描述密切相关的特征信息。第三,我们采用一种新颖的多层Transformer结构,逐步融合这种密切相关的结构信息和文本信息,弥补结构信息的不足,并提高3D生成模型的最终性能。最终的实验结果表明,我们的方法显著提高了3D模型的生成质量,在text2shape数据集上优于当前最优方法。