Liu Kangcheng, Liu Yong-Jin, Chen Baoquan
IEEE Trans Pattern Anal Mach Intell. 2025 Sep;47(9):7352-7368. doi: 10.1109/TPAMI.2025.3566593.
Current prevailing vision-language models have achieved remarkable progress in 3D scene understanding while trained in the closed-set setting and with full labels. The major bottleneck for the current robot 3D scene recognition approach for robotic applications is that these models do not have the capacity to recognize any unseen novel classes beyond the training categories in diverse real-world robot applications such as robot manipulation as well as robot navigation. In the meantime, current state-of-the-art 3D scene understanding approaches primarily require a large number of high-quality labels to train neural networks, which merely perform well in a fully supervised manner. Therefore, we are in urgent need of a framework that can simultaneously be applicable to both 3D point cloud segmentation and detection, particularly in the circumstances where the labels are rather scarce. This work presents a generalized and straightforward framework for dealing with 3D scene understanding when the labeled scenes are quite limited. To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy to extract and distill meaningful information from large-scale vision-language models, which helps benefit the open-vocabulary scene understanding tasks. To leverage the boundary information, we propose a novel energy-based loss with boundary awareness benefiting from the region-level boundary predictions. To encourage latent instance discrimination and to guarantee efficiency, we propose the unsupervised region-level semantic contrastive learning scheme for point clouds, using confident predictions of the neural network to discriminate the intermediate feature embeddings at multiple stages. In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark on both the task of semantic segmentation and instance segmentation. Also, our proposed WS3D++ achieves state-of-the-art data-efficient learning performance on the other large-scale real-scene indoor and outdoor datasets S3DIS and SemanticKITTI. Extensive experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning.
当前流行的视觉语言模型在封闭集设置和全标签训练的情况下,在3D场景理解方面取得了显著进展。当前用于机器人应用的机器人3D场景识别方法的主要瓶颈在于,在诸如机器人操纵和机器人导航等各种现实世界机器人应用中,这些模型没有能力识别训练类别之外的任何未见的新类别。与此同时,当前最先进的3D场景理解方法主要需要大量高质量标签来训练神经网络,这些方法仅在完全监督的方式下表现良好。因此,我们迫切需要一个能够同时适用于3D点云分割和检测的框架,特别是在标签相当稀缺的情况下。这项工作提出了一个通用且简单的框架,用于在标记场景非常有限时处理3D场景理解。为了从预训练的视觉语言模型中提取新类别的知识,我们提出了一种分层特征对齐预训练和知识蒸馏策略,以从大规模视觉语言模型中提取和蒸馏有意义的信息,这有助于促进开放词汇场景理解任务。为了利用边界信息,我们提出了一种基于能量的具有边界感知的损失,受益于区域级边界预测。为了鼓励潜在实例区分并保证效率,我们提出了用于点云的无监督区域级语义对比学习方案,使用神经网络的置信预测在多个阶段区分中间特征嵌入。在有限重建的情况下,我们提出的方法WS3D++在大规模ScanNet基准测试的语义分割和实例分割任务上均排名第一。此外,我们提出的WS3D++在其他大规模真实场景室内和室外数据集S3DIS和SemanticKITTI上实现了最先进的数据高效学习性能。在室内和室外场景的广泛实验证明了我们的方法在数据高效学习和开放世界少样本学习方面的有效性。