用于实时语义分割的上下文和空间特征校准

Context and Spatial Feature Calibration for Real-Time Semantic Segmentation.

作者信息

Li Kaige, Geng Qichuan, Wan Maoxian, Cao Xiaochun, Zhou Zhong

出版信息

IEEE Trans Image Process. 2023;32:5465-5477. doi: 10.1109/TIP.2023.3318967. Epub 2023 Oct 25.

DOI:10.1109/TIP.2023.3318967

Abstract

Context modeling or multi-level feature fusion methods have been proved to be effective in improving semantic segmentation performance. However, they are not specialized to deal with the problems of pixel-context mismatch and spatial feature misalignment, and the high computational complexity hinders their widespread application in real-time scenarios. In this work, we propose a lightweight Context and Spatial Feature Calibration Network (CSFCN) to address the above issues with pooling-based and sampling-based attention mechanisms. CSFCN contains two core modules: Context Feature Calibration (CFC) module and Spatial Feature Calibration (SFC) module. CFC adopts a cascaded pyramid pooling module to efficiently capture nested contexts, and then aggregates private contexts for each pixel based on pixel-context similarity to realize context feature calibration. SFC splits features into multiple groups of sub-features along the channel dimension and propagates sub-features therein by the learnable sampling to achieve spatial feature calibration. Extensive experiments on the Cityscapes and CamVid datasets illustrate that our method achieves a state-of-the-art trade-off between speed and accuracy. Concretely, our method achieves 78.7% mIoU with 70.0 FPS and 77.8% mIoU with 179.2 FPS on the Cityscapes and CamVid test sets, respectively. The code is available at https://nave.vr3i.com/ and https://github.com/kaigelee/CSFCN.

摘要

上下文建模或多级特征融合方法已被证明在提高语义分割性能方面是有效的。然而，它们并非专门用于处理像素上下文不匹配和空间特征未对齐的问题，并且高计算复杂度阻碍了它们在实时场景中的广泛应用。在这项工作中，我们提出了一种轻量级的上下文和空间特征校准网络（CSFCN），以通过基于池化和基于采样的注意力机制来解决上述问题。CSFCN包含两个核心模块：上下文特征校准（CFC）模块和空间特征校准（SFC）模块。CFC采用级联金字塔池化模块来有效地捕获嵌套上下文，然后基于像素上下文相似度为每个像素聚合私有上下文，以实现上下文特征校准。SFC沿着通道维度将特征拆分为多组子特征，并通过可学习采样在其中传播子特征，以实现空间特征校准。在Cityscapes和CamVid数据集上进行的大量实验表明，我们的方法在速度和准确性之间实现了最优平衡。具体而言，我们的方法在Cityscapes和CamVid测试集上分别以70.0 FPS的速度实现了78.7%的平均交并比（mIoU）和以179.2 FPS的速度实现了77.8%的mIoU。代码可在https://nave.vr3i.com/和https://github.com/kaigelee/CSFCN获取。