用于三模态图像少样本语义分割的自增强混合注意力网络

Self-Enhanced Mixed Attention Network for Three-Modal Images Few-Shot Semantic Segmentation.

作者信息

Song Kechen, Zhang Yiming, Bao Yanqi, Zhao Ying, Yan Yunhui

机构信息

School of Mechanical Engineering & Automation, Northeastern University, Shenyang 110819, China.

National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China.

出版信息

Sensors (Basel). 2023 Jul 22;23(14):6612. doi: 10.3390/s23146612.

DOI:10.3390/s23146612

PMID:37514905

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10386587/

Abstract

As an important computer vision technique, image segmentation has been widely used in various tasks. However, in some extreme cases, the insufficient illumination would result in a great impact on the performance of the model. So more and more fully supervised methods use multi-modal images as their input. The dense annotated large datasets are difficult to obtain, but the few-shot methods still can have satisfactory results with few pixel-annotated samples. Therefore, we propose the Visible-Depth-Thermal (three-modal) images few-shot semantic segmentation method. It utilizes the homogeneous information of three-modal images and the complementary information of different modal images, which can improve the performance of few-shot segmentation tasks. We constructed a novel indoor dataset VDT-2048-5 for the three-modal images few-shot semantic segmentation task. We also proposed a Self-Enhanced Mixed Attention Network (SEMANet), which consists of a Self-Enhanced module (SE) and a Mixed Attention module (MA). The SE module amplifies the difference between the different kinds of features and strengthens the weak connection for the foreground features. The MA module fuses the three-modal feature to obtain a better feature. Compared with the most advanced methods before, our model improves mIoU by 3.8% and 3.3% in 1-shot and 5-shot settings, respectively, which achieves state-of-the-art performance. In the future, we will solve failure cases by obtaining more discriminative and robust feature representations, and explore achieving high performance with fewer parameters and computational costs.

摘要

作为一种重要的计算机视觉技术，图像分割已广泛应用于各种任务中。然而，在某些极端情况下，光照不足会对模型性能产生重大影响。因此，越来越多的全监督方法使用多模态图像作为输入。虽然难以获得密集标注的大型数据集，但少样本方法在仅有少量像素标注样本的情况下仍能取得令人满意的结果。为此，我们提出了可见光-深度-热红外（三模态）图像少样本语义分割方法。该方法利用三模态图像的同质信息以及不同模态图像的互补信息，能够提高少样本分割任务的性能。我们为三模态图像少样本语义分割任务构建了一个新颖的室内数据集VDT-2048-5。我们还提出了一种自增强混合注意力网络（SEMANet），它由一个自增强模块（SE）和一个混合注意力模块（MA）组成。SE模块放大了不同种类特征之间的差异，并加强了前景特征的弱连接。MA模块融合三模态特征以获得更好的特征。与之前最先进的方法相比，我们的模型在单样本和五样本设置下分别将平均交并比提高了3.8%和3.3%，达到了当前的最优性能。未来，我们将通过获得更具判别力和鲁棒性的特征表示来解决失败案例，并探索以更少的参数和计算成本实现高性能。