Liu Jing, Wang Yuhang, Li Yong, Fu Jun, Li Jiangyun, Lu Hanqing
IEEE Trans Neural Netw Learn Syst. 2018 Nov;29(11):5655-5666. doi: 10.1109/TNNLS.2017.2787781. Epub 2018 Mar 20.
Semantic segmentation and single-view depth estimation are two fundamental problems in computer vision. They exploit the semantic and geometric properties of images, respectively, and are thus complementary in scene understanding. In this paper, we propose a collaborative deconvolutional neural network (C-DCNN) to jointly model these two problems for mutual promotion. The C-DCNN consists of two DCNNs, of which each is for one task. The DCNNs provide a finer resolution reconstruction method and are pretrained with hierarchical supervision. The feature maps from these two DCNNs are integrated via a pointwise bilinear layer, which fuses the semantic and depth information and produces higher order features. Then, the integrated features are fed into two sibling classification layers to simultaneously learn for semantic segmentation and depth estimation. In this way, we combine the semantic and depth features in a unified deep network and jointly train them to benefit each other. Specifically, during network training, we process depth estimation as a classification problem where a soft mapping strategy is proposed to map the continuous depth values into discrete probability distributions and the cross entropy loss is used. Besides, a fully connected conditional random field is also used as postprocessing to further improve the performance of semantic segmentation, where the proximity relations of pixels on position, intensity, and depth are jointly considered. We evaluate our approach on two challenging benchmarks: NYU Depth V2 and SUN RGB-D. It is demonstrated that our approach effectively utilizes these two kinds of information and achieves state-of-the-art results on both the semantic segmentation and depth estimation tasks.
语义分割和单视图深度估计是计算机视觉中的两个基本问题。它们分别利用图像的语义和几何属性,因此在场景理解中是互补的。在本文中,我们提出了一种协作反卷积神经网络(C-DCNN)来联合建模这两个问题以实现相互促进。C-DCNN由两个DCNN组成,每个DCNN用于一个任务。这些DCNN提供了一种更精细分辨率的重建方法,并通过分层监督进行预训练。来自这两个DCNN的特征图通过逐点双线性层进行整合,该层融合语义和深度信息并产生高阶特征。然后,将整合后的特征输入到两个并列的分类层中,以同时学习语义分割和深度估计。通过这种方式,我们在一个统一的深度网络中结合语义和深度特征,并联合训练它们以相互受益。具体来说,在网络训练期间,我们将深度估计作为一个分类问题来处理,其中提出了一种软映射策略将连续的深度值映射到离散的概率分布,并使用交叉熵损失。此外,还使用了一个全连接条件随机场作为后处理来进一步提高语义分割的性能,其中同时考虑了像素在位置、强度和深度上的邻近关系。我们在两个具有挑战性的基准数据集上评估我们的方法:NYU Depth V2和SUN RGB-D。结果表明我们的方法有效地利用了这两种信息,并在语义分割和深度估计任务上都取得了领先的成果。