Department of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL 60616, USA.
Sensors (Basel). 2023 Jan 4;23(2):581. doi: 10.3390/s23020581.
Transformer-based semantic segmentation methods have achieved excellent performance in recent years. Mask2Former is one of the well-known transformer-based methods which unifies common image segmentation into a universal model. However, it performs relatively poorly in obtaining local features and segmenting small objects due to relying heavily on transformers. To this end, we propose a simple yet effective architecture that introduces auxiliary branches to Mask2Former during training to capture dense local features on the encoder side. The obtained features help improve the performance of learning local information and segmenting small objects. Since the proposed auxiliary convolution layers are required only for training and can be removed during inference, the performance gain can be obtained without additional computation at inference. Experimental results show that our model can achieve state-of-the-art performance (57.6% mIoU) on the ADE20K and (84.8% mIoU) on the Cityscapes datasets.
基于 Transformer 的语义分割方法近年来取得了优异的性能。Mask2Former 是基于 Transformer 的知名方法之一,它将常见的图像分割统一为一个通用模型。然而,由于严重依赖于 Transformer,它在获取局部特征和分割小物体方面表现相对较差。为此,我们提出了一种简单而有效的架构,即在训练期间向 Mask2Former 引入辅助分支,以在编码器端捕获密集的局部特征。所获得的特征有助于提高学习局部信息和分割小物体的性能。由于所提出的辅助卷积层仅在训练时需要,并且可以在推理时删除,因此可以在不增加推理计算的情况下获得性能提升。实验结果表明,我们的模型在 ADE20K 数据集上可以达到最先进的性能(57.6% mIoU),在 Cityscapes 数据集上可以达到(84.8% mIoU)。