IEEE Trans Image Process. 2023;32:3521-3535. doi: 10.1109/TIP.2023.3286708. Epub 2023 Jun 29.
Inspired by Active Learning and 2D-3D semantic fusion, we proposed a novel framework for 3D scene semantic segmentation based on rendered 2D images, which could efficiently achieve semantic segmentation of any large-scale 3D scene with only a few 2D image annotations. In our framework, we first render perspective images at certain positions in the 3D scene. Then we continuously fine-tune a pre-trained network for image semantic segmentation and project all dense predictions to the 3D model for fusion. In each iteration, we evaluate the 3D semantic model and re-render images in several representative areas where the 3D segmentation is not stable and send them to the network for training after annotation. Through this iterative process of rendering-segmentation-fusion, it can effectively generate difficult-to-segment image samples in the scene, while avoiding complex 3D annotations, so as to achieve label-efficient 3D scene segmentation. Experiments on three large-scale indoor and outdoor 3D datasets demonstrate the effectiveness of the proposed method compared with other state-of-the-art.
受主动学习和 2D-3D 语义融合的启发,我们提出了一种新颖的基于渲染 2D 图像的 3D 场景语义分割框架,仅用少量 2D 图像标注即可高效实现任意大规模 3D 场景的语义分割。在我们的框架中,我们首先在 3D 场景中的某些位置渲染透视图像。然后,我们不断微调用于图像语义分割的预训练网络,并将所有密集预测投影到 3D 模型进行融合。在每次迭代中,我们评估 3D 语义模型并在几个代表性区域重新渲染 3D 分割不稳定的图像,并在标注后将其发送到网络进行训练。通过渲染-分割-融合的迭代过程,可以有效地生成场景中难以分割的图像样本,同时避免复杂的 3D 标注,从而实现高效的 3D 场景分割。在三个大规模室内和室外 3D 数据集上的实验表明,与其他最先进的方法相比,所提出的方法具有有效性。