IEEE Trans Pattern Anal Mach Intell. 2022 Aug;44(8):4454-4468. doi: 10.1109/TPAMI.2021.3063611. Epub 2022 Jul 1.
It is quite laborious and costly to manually label LiDAR point cloud data for training high-quality 3D object detectors. This work proposes a weakly supervised framework which allows learning 3D detection from a few weakly annotated examples. This is achieved by a two-stage architecture design. Stage-1 learns to generate cylindrical object proposals under inaccurate and inexact supervision, obtained by our proposed BEV center-click annotation strategy, where only the horizontal object centers are click-annotated in bird's view scenes. Stage-2 learns to predict cuboids and confidence scores in a coarse-to-fine, cascade manner, under incomplete supervision, i.e., only a small portion of object cuboids are precisely annotated. With KITTI dataset, using only 500 weakly annotated scenes and 534 precisely labeled vehicle instances, our method achieves 86-97 percent the performance of current top-leading, fully supervised detectors (which require 3,712 exhaustively annotated scenes with 15,654 instances). More importantly, with our elaborately designed network architecture, our trained model can be applied as a 3D object annotator, supporting both automatic and active (human-in-the-loop) working modes. The annotations generated by our model can be used to train 3D object detectors, achieving over 95 percent of their original performance (with manually labeled training data). Our experiments also show our model's potential in boosting performance when given more training data. The above designs make our approach highly practical and open-up opportunities for learning 3D detection at reduced annotation cost.
手动标注激光雷达点云数据来训练高质量的 3D 目标检测模型非常繁琐且成本高昂。本研究提出了一种弱监督框架,可通过少量弱标注示例学习 3D 检测。该框架采用两阶段架构设计。第一阶段在不准确和不精确的监督下学习生成圆柱目标建议,这种监督是通过我们提出的 BEV 中心点击标注策略获得的,其中仅在鸟瞰场景中点击标注水平方向的目标中心。第二阶段在不完全监督下,以粗到精、级联的方式学习预测长方体和置信度得分,即仅对一小部分目标长方体进行精确标注。在 KITTI 数据集上,使用仅 500 个弱标注场景和 534 个精确标注的车辆实例,我们的方法实现了与当前顶级、全监督检测器(需要 3712 个经过详尽标注的场景和 15654 个实例)相当的 86-97%的性能。更重要的是,通过精心设计的网络架构,我们的训练模型可作为 3D 目标标注器,支持自动和主动(人机交互)两种工作模式。我们的模型生成的标注可用于训练 3D 目标检测器,实现超过 95%的原始性能(使用手动标注的训练数据)。我们的实验还表明,在提供更多训练数据时,我们的模型具有提高性能的潜力。上述设计使我们的方法具有高度实用性,并为降低标注成本学习 3D 检测开辟了机会。