Xu Xiuwei, Wang Ziwei, Zhou Jie, Lu Jiwen
IEEE Trans Pattern Anal Mach Intell. 2024 Feb;46(2):1165-1180. doi: 10.1109/TPAMI.2023.3328880. Epub 2024 Jan 8.
In this paper, we propose a weakly-supervised approach for 3D object detection, which makes it possible to train a strong 3D detector with position-level annotations (i.e. annotations of object centers and categories). In order to remedy the information loss from box annotations to centers, our method makes use of synthetic 3D shapes to convert the position-level annotations into virtual scenes with box-level annotations, and in turn utilizes the fully-annotated virtual scenes to complement the real labels. Specifically, we first present a shape-guided label-enhancement method, which assembles 3D shapes into physically reasonable virtual scenes according to the coarse scene layout extracted from position-level annotations. Then we transfer the information contained in the virtual scenes back to real ones by applying a virtual-to-real domain adaptation method, which refines the annotated object centers and additionally supervises the training of detector with the virtual scenes. Since the shape-guided label enhancement method generates virtual scenes by human-heuristic physical constraints, the layout of the fixed virtual scenes may be unreasonable with varied object combinations. To address this, we further present differentiable label enhancement to optimize the virtual scenes including object scales, orientations and locations in a data-driven manner. Moreover, we further propose a label-assisted self-training strategy to fully exploit the capability of detector. By reusing the position-level annotations and virtual scenes, we fuse the information from both domains and generate box-level pseudo labels on the real scenes, which enables us to directly train a detector in fully-supervised manner. Extensive experiments on the widely used ScanNet and Matterport3D datasets show that our approach surpasses current weakly-supervised and semi-supervised methods by a large margin, and achieves comparable detection performance with some popular fully-supervised methods with less than 5% of the labeling labor.
在本文中,我们提出了一种用于3D目标检测的弱监督方法,该方法使得使用位置级注释(即目标中心和类别的注释)训练强大的3D检测器成为可能。为了弥补从边界框注释到中心的信息损失,我们的方法利用合成3D形状将位置级注释转换为具有边界框级注释的虚拟场景,进而利用完全注释的虚拟场景来补充真实标签。具体而言,我们首先提出一种形状引导的标签增强方法,该方法根据从位置级注释中提取的粗略场景布局将3D形状组装成物理上合理的虚拟场景。然后,我们通过应用虚拟到真实域适应方法将虚拟场景中包含的信息转移回真实场景,该方法细化注释的目标中心,并额外使用虚拟场景监督检测器的训练。由于形状引导的标签增强方法通过人工启发式物理约束生成虚拟场景,对于不同的目标组合,固定虚拟场景的布局可能不合理。为了解决这个问题,我们进一步提出可微标签增强,以数据驱动的方式优化包括目标尺度、方向和位置的虚拟场景。此外,我们还提出了一种标签辅助自训练策略,以充分利用检测器的能力。通过重用位置级注释和虚拟场景,我们融合来自两个域的信息,并在真实场景上生成边界框级伪标签,这使我们能够以完全监督的方式直接训练检测器。在广泛使用的ScanNet和Matterport3D数据集上进行的大量实验表明,我们的方法大幅超越了当前的弱监督和半监督方法,并且在标注工作量不到5%的情况下,实现了与一些流行的完全监督方法相当的检测性能。