IEEE Trans Pattern Anal Mach Intell. 2017 Dec;39(12):2481-2495. doi: 10.1109/TPAMI.2016.2644615. Epub 2017 Jan 2.
We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network [1] . The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN [2] and also with the well known DeepLab-LargeFOV [3] , DeconvNet [4] architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and can be trained end-to-end using stochastic gradient descent. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. These quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet.
我们提出了一种新颖而实用的深度全卷积神经网络架构,用于语义像素级分割,称为 SegNet。这个可训练的核心分割引擎由一个编码器网络、一个相应的解码器网络以及一个像素级分类层组成。编码器网络的结构与 VGG16 网络 [1] 中的 13 个卷积层拓扑相同。解码器网络的作用是将低分辨率的编码器特征映射映射到全输入分辨率的特征映射,以进行像素级分类。SegNet 的新颖之处在于解码器对其低分辨率输入特征图进行上采样的方式。具体来说,解码器使用相应编码器的最大池化步骤中计算的池化索引来执行非线性上采样。这消除了学习上采样的需要。上采样的映射是稀疏的,然后与可训练滤波器卷积以生成密集特征映射。我们将我们提出的架构与广泛采用的 FCN [2] 进行了比较,也与著名的 DeepLab-LargeFOV [3] 、DeconvNet [4] 架构进行了比较。这种比较揭示了在实现良好分割性能时涉及的内存与准确性之间的权衡。SegNet 主要是为了场景理解应用而提出的。因此,它在推理过程中的内存和计算时间方面都非常高效。与其他竞争架构相比,它的可训练参数数量也显著减少,并且可以使用随机梯度下降进行端到端训练。我们还在道路场景和 SUN RGB-D 室内场景分割任务上对 SegNet 和其他架构进行了受控基准测试。这些定量评估表明,与其他架构相比,SegNet 提供了具有竞争力的推理时间和最有效的推理内存的良好性能。我们还提供了 SegNet 的 Caffe 实现和一个在 http://mi.eng.cam.ac.uk/projects/segnet 上的网络演示。