IEEE Trans Pattern Anal Mach Intell. 2018 Apr;40(4):834-848. doi: 10.1109/TPAMI.2017.2699184. Epub 2017 Apr 27.
In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or 'atrous convolution', as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed "DeepLab" system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7 percent mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.
在这项工作中,我们致力于使用深度学习进行语义图像分割,并做出了三个主要贡献,这些贡献在实验中被证明具有很大的实际价值。首先,我们强调了带有上采样滤波器的卷积,即“空洞卷积”,作为密集预测任务中的强大工具。空洞卷积允许我们在深度卷积神经网络中显式地控制特征响应的计算分辨率。它还允许我们有效地扩大滤波器的视野,在不增加参数数量或计算量的情况下纳入更大的上下文。其次,我们提出了空洞空间金字塔池化(ASPP),以在多个尺度上稳健地分割对象。ASPP 使用多个采样率和有效感受野的滤波器探测输入的卷积特征层,从而在多个尺度上捕获对象和图像上下文。第三,我们通过结合 DCNN 和概率图形模型的方法来提高对象边界的定位精度。DCNN 中通常采用的最大池化和下采样组合实现了不变性,但牺牲了定位精度。我们通过在最后一个 DCNN 层的响应与全连接条件随机场(CRF)相结合来克服这一问题,这在定性和定量上都显示出了提高定位性能的效果。我们提出的“DeepLab”系统在 PASCAL VOC-2012 语义图像分割任务中设定了新的技术水平,在测试集上达到了 79.7%的 mIOU,并在另外三个数据集 PASCAL-Context、PASCAL-Person-Part 和 Cityscapes 上取得了进展。我们的所有代码都在网上公开。