Liu Jianbo, He Junjun, Zheng Yuanjie, Yi Shuai, Wang Xiaogang, Li Hongsheng
IEEE Trans Pattern Anal Mach Intell. 2023 Oct;45(10):11390-11406. doi: 10.1109/TPAMI.2021.3114342. Epub 2023 Sep 5.
Both high-level and high-resolution feature representations are of great importance in various visual understanding tasks. To acquire high-resolution feature maps with high-level semantic information, one common strategy is to adopt dilated convolutions in the backbone networks to extract high-resolution feature maps, such as the dilatedFCN-based methods for semantic segmentation. However, due to many convolution operations are conducted on the high-resolution feature maps, such methods have large computational complexity and memory consumption. To balance the performance and efficiency, there also exist encoder-decoder structures that gradually recover the spatial information by combining multi-level feature maps from a feature encoder, such as the FPN architecture for object detection and the U-Net for semantic segmentation. Although being more efficient, the performances of existing encoder-decoder methods for semantic segmentation are far from comparable with the dilatedFCN-based methods. In this paper, we propose one novel holistically-guided decoder which is introduced to obtain the high-resolution semantic-rich feature maps via the multi-scale features from the encoder. The decoding is achieved via novel holistic codeword generation and codeword assembly operations, which take advantages of both the high-level and low-level features from the encoder features. With the proposed holistically-guided decoder, we implement the EfficientFCN architecture for semantic segmentation and HGD-FPN for object detection and instance segmentation. The EfficientFCN achieves comparable or even better performance than state-of-the-art methods with only 1/3 of their computational costs for semantic segmentation on PASCAL Context, PASCAL VOC, ADE20K datasets. Meanwhile, the proposed HGD-FPN achieves higher mean Average Precision (mAP) when integrated into several object detection frameworks with ResNet-50 encoding backbones.
高级别和高分辨率特征表示在各种视觉理解任务中都非常重要。为了获取具有高级语义信息的高分辨率特征图,一种常见的策略是在主干网络中采用空洞卷积来提取高分辨率特征图,例如基于空洞全卷积网络(dilatedFCN)的语义分割方法。然而,由于在高分辨率特征图上进行了许多卷积操作,此类方法具有较大的计算复杂度和内存消耗。为了平衡性能和效率,也存在编码器-解码器结构,通过组合来自特征编码器的多级特征图来逐步恢复空间信息,例如用于目标检测的特征金字塔网络(FPN)架构和用于语义分割的U-Net。尽管现有用于语义分割的编码器-解码器方法效率更高,但其性能与基于空洞全卷积网络的方法相比仍有很大差距。在本文中,我们提出了一种新颖的整体引导解码器,通过来自编码器的多尺度特征来获取高分辨率、富含语义的特征图。解码是通过新颖的整体码字生成和码字组装操作实现的,这些操作利用了编码器特征中的高级和低级特征。借助所提出的整体引导解码器,我们实现了用于语义分割的高效全卷积网络(EfficientFCN)架构以及用于目标检测和实例分割的HGD-FPN。在PASCAL Context、PASCAL VOC、ADE20K数据集上进行语义分割时,EfficientFCN仅以现有最先进方法三分之一的计算成本就实现了可比甚至更好的性能。同时,当将所提出的HGD-FPN集成到几个使用ResNet-50编码主干的目标检测框架中时,它实现了更高的平均精度均值(mAP)。