IEEE Trans Neural Netw Learn Syst. 2021 Nov;32(11):5034-5046. doi: 10.1109/TNNLS.2020.3026669. Epub 2021 Oct 27.
Many computer vision tasks, such as monocular depth estimation and height estimation from a satellite orthophoto, have a common underlying goal, which is regression of dense continuous values for the pixels given a single image. We define them as dense continuous-value regression (DCR) tasks. Recent approaches based on deep convolutional neural networks significantly improve the performance of DCR tasks, particularly on pixelwise regression accuracy. However, it still remains challenging to simultaneously preserve the global structure and fine object details in complex scenes. In this article, we take advantage of the efficiency of Laplacian pyramid on representing multiscale contents to reconstruct high-quality signals for complex scenes. We design a Laplacian pyramid neural network (LAPNet), which consists of a Laplacian pyramid decoder (LPD) for signal reconstruction and an adaptive dense feature fusion (ADFF) module to fuse features from the input image. More specifically, we build an LPD to effectively express both global and local scene structures. In our LPD, the upper and lower levels, respectively, represent scene layouts and shape details. We introduce a residual refinement module to progressively complement high-frequency details for signal prediction at each level. To recover the signals at each individual level in the pyramid, an ADFF module is proposed to adaptively fuse multiscale image features for accurate prediction. We conduct comprehensive experiments to evaluate a number of variants of our model on three important DCR tasks, i.e., monocular depth estimation, single-image height estimation, and density map estimation for crowd counting. Experiments demonstrate that our method achieves new state-of-the-art performance in both qualitative and quantitative evaluation on the NYU-D V2 and KITTI for monocular depth estimation, the challenging Urban Semantic 3D (US3D) for satellite height estimation, and four challenging benchmarks for crowd counting. These results demonstrate that the proposed LAPNet is a universal and effective architecture for DCR problems.
许多计算机视觉任务,如单目深度估计和卫星正射影像的高度估计,都有一个共同的基本目标,即为单个图像的像素回归密集连续值。我们将其定义为密集连续值回归(DCR)任务。基于深度卷积神经网络的最新方法极大地提高了 DCR 任务的性能,特别是在像素级回归精度方面。然而,在复杂场景中同时保留全局结构和精细目标细节仍然具有挑战性。在本文中,我们利用拉普拉斯金字塔在表示多尺度内容方面的效率,为复杂场景重建高质量的信号。我们设计了一个拉普拉斯金字塔神经网络(LAPNet),它由一个拉普拉斯金字塔解码器(LPD)用于信号重建和一个自适应密集特征融合(ADFF)模块来融合输入图像的特征。具体来说,我们构建了一个 LPD 来有效地表达全局和局部场景结构。在我们的 LPD 中,上下层分别表示场景布局和形状细节。我们引入了一个残差细化模块,在每个级别上逐步补充高频细节以进行信号预测。为了在金字塔的每个单独级别恢复信号,我们提出了一个 ADFF 模块来自适应融合多尺度图像特征以进行准确预测。我们进行了全面的实验,在三个重要的 DCR 任务上评估了我们模型的许多变体,即单目深度估计、单图像高度估计和人群计数的密度图估计。实验表明,我们的方法在 NYU-D V2 和 KITTI 上的单目深度估计、具有挑战性的卫星高度估计 Urban Semantic 3D(US3D)以及人群计数的四个挑战性基准上,在定性和定量评估方面均取得了新的最先进的性能。这些结果表明,所提出的 LAPNet 是一种通用有效的 DCR 问题架构。