Sediqi Khwaja Monib, Lee Hyo Jong
Division of Computer Science and Engineering, CAIIT, Jeonbuk National University, Jeonju 54896, Korea.
Sensors (Basel). 2021 Mar 20;21(6):2170. doi: 10.3390/s21062170.
Semantic segmentation, which refers to pixel-wise classification of an image, is a fundamental topic in computer vision owing to its growing importance in the robot vision and autonomous driving sectors. It provides rich information about objects in the scene such as object boundary, category, and location. Recent methods for semantic segmentation often employ an encoder-decoder structure using deep convolutional neural networks. The encoder part extracts features of the image using several filters and pooling operations, whereas the decoder part gradually recovers the low-resolution feature maps of the encoder into a full input resolution feature map for pixel-wise prediction. However, the encoder-decoder variants for semantic segmentation suffer from severe spatial information loss, caused by pooling operations or stepwise convolutions, and does not consider the context in the scene. In this paper, we propose a novel dense upsampling convolution method based on a guided filter to effectively preserve the spatial information of the image in the network. We further propose a novel local context convolution method that not only covers larger-scale objects in the scene but covers them densely for precise object boundary delineation. Theoretical analyses and experimental results on several benchmark datasets verify the effectiveness of our method. Qualitatively, our approach delineates object boundaries at a level of accuracy that is beyond the current excellent methods. Quantitatively, we report a new record of 82.86% and 81.62% of pixel accuracy on ADE20K and Pascal-Context benchmark datasets, respectively. In comparison with the state-of-the-art methods, the proposed method offers promising improvements.
语义分割是指对图像进行逐像素分类,由于其在机器人视觉和自动驾驶领域的重要性日益增加,它已成为计算机视觉中的一个基本课题。它提供了有关场景中物体的丰富信息,如物体边界、类别和位置。最近的语义分割方法通常采用基于深度卷积神经网络的编码器-解码器结构。编码器部分使用多个滤波器和池化操作来提取图像特征,而解码器部分则将编码器的低分辨率特征图逐步恢复为全输入分辨率特征图,用于逐像素预测。然而,用于语义分割的编码器-解码器变体存在严重的空间信息损失问题,这是由池化操作或逐步卷积导致的,并且没有考虑场景中的上下文信息。在本文中,我们提出了一种基于引导滤波器的新型密集上采样卷积方法,以有效地在网络中保留图像的空间信息。我们还提出了一种新型的局部上下文卷积方法,该方法不仅可以覆盖场景中更大尺度的物体,而且可以密集地覆盖它们,以精确勾勒物体边界。在几个基准数据集上的理论分析和实验结果验证了我们方法的有效性。定性地说,我们的方法在精度水平上能够勾勒物体边界,超越了当前的优秀方法。定量地说,我们分别在ADE20K和Pascal-Context基准数据集上报告了新的像素准确率记录,分别为82.86%和81.62%。与现有最先进的方法相比,所提出的方法有显著的改进。