Zheng Yu, Duan Yueqi, Li Zongtai, Zhou Jie, Lu Jiwen
IEEE Trans Pattern Anal Mach Intell. 2024 May;46(5):2981-2996. doi: 10.1109/TPAMI.2023.3336874. Epub 2024 Apr 3.
In this paper, we propose a dynamic 3D object detector named HyperDet3D, which is adaptively adjusted based on the hyper scene-level knowledge on the fly. Existing methods strive for object-level representations of local elements and their relations without scene-level priors, which suffer from ambiguity between similarly-structured objects only based on the understanding of individual points and object candidates. Instead, we design scene-conditioned hypernetworks to simultaneously learn scene-agnostic embeddings to exploit sharable abstracts from various 3D scenes, and scene-specific knowledge which adapts the 3D detector to the given scene at test time. As a result, the lower-level ambiguity in object representations can be addressed by hierarchical context in scene priors. However, since the upstream hypernetwork in HyperDet3D takes raw scenes as input which contain noises and redundancy, it leads to sub-optimal parameters produced for the 3D detector simply under the constraint of downstream detection losses. Based on the fact that the downstream 3D detection task can be factorized into object-level semantic classification and bounding box regression, we furtherly propose HyperFormer3D by correspondingly designing their scene-level prior tasks in upstream hypernetworks, namely Semantic Occurrence and Objectness Localization. To this end, we design a transformer-based hypernetwork that translates the task-oriented scene priors into parameters of the downstream detector, which refrains from noises and redundancy of the scenes. Extensive experimental results on the ScanNet, SUN RGB-D and MatterPort3D datasets demonstrate the effectiveness of the proposed methods.
在本文中,我们提出了一种名为HyperDet3D的动态3D目标检测器,它能够根据超场景级知识实时进行自适应调整。现有方法在没有场景级先验的情况下,致力于局部元素及其关系的目标级表示,仅基于对单个点和目标候选的理解,这些方法在结构相似的目标之间存在模糊性。相反,我们设计了场景条件超网络,以同时学习与场景无关的嵌入,从而从各种3D场景中利用可共享的抽象信息,以及在测试时使3D检测器适应给定场景的特定场景知识。因此,目标表示中的低级模糊性可以通过场景先验中的层次上下文来解决。然而,由于HyperDet3D中的上游超网络以包含噪声和冗余的原始场景作为输入,这仅在下游检测损失的约束下导致为3D检测器生成次优参数。基于下游3D检测任务可以分解为目标级语义分类和边界框回归这一事实,我们通过在上游超网络中相应地设计它们的场景级先验任务,即语义出现和目标定位,进一步提出了HyperFormer3D。为此,我们设计了一个基于Transformer的超网络,将面向任务的场景先验转换为下游检测器的参数,从而避免了场景的噪声和冗余。在ScanNet、SUN RGB-D和MatterPort3D数据集上的大量实验结果证明了所提出方法的有效性。