Computer Vision Center (CVC), Universitat Autònoma de Barcelona (UAB), 08193 Bellaterra, Spain.
Computer Science Department, Universitat Autònoma de Barcelona (UAB), 08193 Bellaterra, Spain.
Sensors (Basel). 2021 May 4;21(9):3185. doi: 10.3390/s21093185.
Top-performing computer vision models are powered by convolutional neural networks (CNNs). Training an accurate CNN highly depends on both the raw sensor data and their associated ground truth (GT). Collecting such GT is usually done through human labeling, which is time-consuming and does not scale as we wish. This data-labeling bottleneck may be intensified due to domain shifts among image sensors, which could force per-sensor data labeling. In this paper, we focus on the use of co-training, a semi-supervised learning (SSL) method, for obtaining self-labeled object bounding boxes (BBs), i.e., the GT to train deep object detectors. In particular, we assess the goodness of multi-modal co-training by relying on two different views of an image, namely, appearance (RGB) and estimated depth (D). Moreover, we compare appearance-based single-modal co-training with multi-modal. Our results suggest that in a standard SSL setting (no domain shift, a few human-labeled data) and under virtual-to-real domain shift (many virtual-world labeled data, no human-labeled data) multi-modal co-training outperforms single-modal. In the latter case, by performing GAN-based domain translation both co-training modalities are on par, at least when using an off-the-shelf depth estimation model not specifically trained on the translated images.
表现最佳的计算机视觉模型是由卷积神经网络(CNN)驱动的。训练一个准确的 CNN 高度依赖于原始传感器数据及其相关的地面实况(GT)。此类 GT 的收集通常是通过人工标注完成的,这既耗时又无法满足我们的期望。由于图像传感器之间存在域转移,这种数据标注瓶颈可能会加剧,这可能需要对每个传感器的数据进行标注。在本文中,我们专注于使用协同训练(一种半监督学习(SSL)方法)来获得自我标注的目标边界框(BB),即训练深度目标检测器的 GT。具体来说,我们通过依赖图像的两种不同视图,即外观(RGB)和估计的深度(D),来评估多模态协同训练的效果。此外,我们比较了基于外观的单模态协同训练和多模态协同训练。我们的结果表明,在标准的 SSL 设置(无域转移,少量人工标注数据)和虚拟到真实的域转移(许多虚拟世界标注数据,无人工标注数据)下,多模态协同训练优于单模态协同训练。在后一种情况下,通过执行基于 GAN 的域转换,两种协同训练模式都具有可比性,至少在使用未专门针对转换图像进行训练的现成深度估计模型时是如此。