School of Software Engineering, South China University of Technology, Guangzhou 510006, China.
Sensors (Basel). 2021 May 9;21(9):3270. doi: 10.3390/s21093270.
The main challenges of semantic segmentation in vehicle-mounted scenes are object scale variation and trading off model accuracy and efficiency. Lightweight backbone networks for semantic segmentation usually extract single-scale features layer-by-layer only by using a fixed receptive field. Most modern real-time semantic segmentation networks heavily compromise spatial details when encoding semantics, and sacrifice accuracy for speed. Many improving strategies adopt dilated convolution and add a sub-network, in which either intensive computation or redundant parameters are brought. We propose a multi-level and multi-scale feature aggregation network (MMFANet). A spatial pyramid module is designed by cascading dilated convolutions with different receptive fields to extract multi-scale features layer-by-layer. Subseqently, a lightweight backbone network is built by reducing the feature channel capacity of the module. To improve the accuracy of our network, we design two additional modules to separately capture spatial details and high-level semantics from the backbone network without significantly increasing the computation cost. Comprehensive experimental results show that our model achieves 79.3% MIoU on the Cityscapes test dataset at a speed of 58.5 FPS, and it is more accurate than SwiftNet (75.5% MIoU). Furthermore, the number of parameters of our model is at least 53.38% less than that of other state-of-the-art models.
车载场景下语义分割的主要挑战是目标尺度变化以及模型精度和效率的权衡。用于语义分割的轻量级骨干网络通常仅通过使用固定感受野逐层提取单尺度特征。大多数现代实时语义分割网络在编码语义时严重牺牲空间细节,为了速度而牺牲精度。许多改进策略采用空洞卷积并添加子网络,这会带来密集的计算或冗余参数。我们提出了一种多级多尺度特征聚合网络(MMFANet)。通过级联具有不同感受野的空洞卷积来设计一个空间金字塔模块,以逐层提取多尺度特征。随后,通过减少模块的特征通道容量来构建轻量级骨干网络。为了提高我们网络的准确性,我们设计了两个额外的模块,分别从骨干网络中捕获空间细节和高级语义,而不会显著增加计算成本。全面的实验结果表明,我们的模型在 Cityscapes 测试数据集上的 MIoU 达到 79.3%,速度为 58.5 FPS,比 SwiftNet(75.5% MIoU)更准确。此外,我们模型的参数量比其他最先进的模型至少少 53.38%。