Song Xibin, Li Wei, Zhou Dingfu, Dai Yuchao, Fang Jin, Li Hongdong, Zhang Liangjun
IEEE Trans Image Process. 2021;30:4691-4705. doi: 10.1109/TIP.2021.3074306. Epub 2021 May 3.
The success of supervised learning-based single image depth estimation methods critically depends on the availability of large-scale dense per-pixel depth annotations, which requires both laborious and expensive annotation process. Therefore, the self-supervised methods are much desirable, which attract significant attention recently. However, depth maps predicted by existing self-supervised methods tend to be blurry with many depth details lost. To overcome these limitations, we propose a novel framework, named MLDA-Net, to obtain per-pixel depth maps with shaper boundaries and richer depth details. Our first innovation is a multi-level feature extraction (MLFE) strategy which can learn rich hierarchical representation. Then, a dual-attention strategy, combining global attention and structure attention, is proposed to intensify the obtained features both globally and locally, resulting in improved depth maps with sharper boundaries. Finally, a reweighted loss strategy based on multi-level outputs is proposed to conduct effective supervision for self-supervised depth estimation. Experimental results demonstrate that our MLDA-Net framework achieves state-of-the-art depth prediction results on the KITTI benchmark for self-supervised monocular depth estimation with different input modes and training modes. Extensive experiments on other benchmark datasets further confirm the superiority of our proposed approach.
基于监督学习的单图像深度估计方法的成功严重依赖于大规模密集的逐像素深度标注的可用性,这需要费力且昂贵的标注过程。因此,自监督方法非常受欢迎,最近引起了广泛关注。然而,现有自监督方法预测的深度图往往模糊,丢失了许多深度细节。为了克服这些局限性,我们提出了一种名为MLDA-Net的新颖框架,以获得具有更清晰边界和更丰富深度细节的逐像素深度图。我们的第一个创新是一种多级特征提取(MLFE)策略,它可以学习丰富的层次表示。然后,提出了一种结合全局注意力和结构注意力的双注意力策略,以在全局和局部层面强化所获得的特征,从而得到具有更清晰边界的改进深度图。最后,提出了一种基于多级输出的重新加权损失策略,以对自监督深度估计进行有效监督。实验结果表明,我们的MLDA-Net框架在KITTI基准上针对不同输入模式和训练模式的自监督单目深度估计取得了领先的深度预测结果。在其他基准数据集上的大量实验进一步证实了我们所提方法的优越性。