Ding Runyu, Yang Jihan, Xue Chuhui, Zhang Wenqing, Bai Song, Qi Xiaojuan
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):8517-8533. doi: 10.1109/TPAMI.2024.3410324. Epub 2024 Nov 6.
Open-world instance-level scene understanding aims to locate and recognize unseen object categories that are not present in the annotated dataset. This task is challenging because the model needs to both localize novel 3D objects and infer their semantic categories. A key factor for the recent progress in 2D open-world perception is the availability of large-scale image-text pairs from the Internet, which cover a wide range of vocabulary concepts. However, this success is hard to replicate in 3D scenarios due to the scarcity of 3D-text pairs. To address this challenge, we propose to harness pre-trained vision-language (VL) foundation models that encode extensive knowledge from image-text pairs to generate captions for multi-view images of 3D scenes. This allows us to establish explicit associations between 3D shapes and semantic-rich captions. Moreover, to enhance the fine-grained visual-semantic representation learning from captions for object-level categorization, we design hierarchical point-caption association methods to learn semantic-aware embeddings that exploit the 3D geometry between 3D points and multi-view images. In addition, to tackle the localization challenge for novel classes in the open-world setting, we develop debiased instance localization, which involves training object grouping modules on unlabeled data using instance-level pseudo supervision. This significantly improves the generalization capabilities of instance grouping and, thus, the ability to accurately locate novel objects. We conduct extensive experiments on 3D semantic, instance, and panoptic segmentation tasks, covering indoor and outdoor scenes across three datasets. Our method outperforms baseline methods by a significant margin in semantic segmentation (e.g. 34.5% ∼ 65.3%), instance segmentation (e.g. 21.8% ∼ 54.0%), and panoptic segmentation (e.g. 14.7% ∼ 43.3%). Code will be available.
开放世界实例级场景理解旨在定位和识别注释数据集中不存在的未见物体类别。这项任务具有挑战性,因为模型需要同时定位新颖的3D物体并推断其语义类别。二维开放世界感知最近取得进展的一个关键因素是可从互联网获得大规模图像-文本对,这些对涵盖了广泛的词汇概念。然而,由于3D-文本对的稀缺,这种成功在3D场景中难以复制。为应对这一挑战,我们建议利用预训练的视觉语言(VL)基础模型,这些模型对来自图像-文本对的广泛知识进行编码,以生成3D场景多视图图像的字幕。这使我们能够在3D形状和语义丰富的字幕之间建立明确的关联。此外,为了增强从字幕进行的细粒度视觉语义表示学习以用于物体级分类,我们设计了分层点-字幕关联方法,以学习利用3D点与多视图图像之间的3D几何关系的语义感知嵌入。此外,为了应对开放世界环境中新颖类别的定位挑战,我们开发了偏差消除实例定位,这涉及使用实例级伪监督在未标记数据上训练物体分组模块。这显著提高了实例分组的泛化能力,从而提高了准确定位新颖物体的能力。我们在3D语义、实例和全景分割任务上进行了广泛的实验,涵盖了三个数据集中的室内和室外场景。我们的方法在语义分割(例如34.5% ∼ 65.3%)、实例分割(例如21.8% ∼ 54.0%)和全景分割(例如14.7% ∼ 43.3%)方面比基线方法有显著优势。代码将公开。