Xu Mengde, Zhang Zheng, Wei Fangyun, Hu Han, Bai Xiang
IEEE Trans Pattern Anal Mach Intell. 2023 Dec;45(12):15546-15561. doi: 10.1109/TPAMI.2023.3311618. Epub 2023 Nov 3.
This article concentrates on open-vocabulary semantic segmentation, where a well optimized model is able to segment arbitrary categories that appear in an image. To achieve this goal, we present a novel framework termed Side Adapter Network, or SAN for short. Our design principles are three-fold: 1) Recent large-scale vision-language models (e.g. CLIP) show promising open-vocabulary image classification capability; it is training-economized to adapt a pre-trained CLIP model to open-vocabulary semantic segmentation. 2) Our SAN model should be both lightweight and effective in order to reduce the inference cost-to achieve this, we fuse the CLIP model's intermediate features to enhance the representation capability of the SAN model, and drive the CLIP model to focus on the informative areas of an image with the aid of the attention biases predicted by a side adapter network. 3) Our approach should empower mainstream segmentation architectures to have the capability of open-vocabulary segmentation-we present P-SAN and R-SAN, to support widely adopted pixel-wise segmentation and region-wise segmentation, respectively. Experimentally, our approach achieves state-of-the-art performance on 5 commonly used benchmarks while having much less trainable parameters and GFLOPs. For instance, our R-SAN outperforms previous best method OvSeg by +2.3 averaged mIoU across all benchmarks while using only 6% of trainable parameters and less than 1% of GFLOPs. In addition, we also conduct a comprehensive analysis of the open-vocabulary semantic segmentation datasets and verify the feasibility of transferring a well optimzied R-SAN model to video segmentation task.
本文专注于开放词汇语义分割,即一个经过充分优化的模型能够对图像中出现的任意类别进行分割。为实现这一目标,我们提出了一种名为边适配器网络(Side Adapter Network,简称SAN)的新颖框架。我们的设计原则有三点:1)近期的大规模视觉语言模型(如CLIP)展现出了颇具前景的开放词汇图像分类能力;将预训练的CLIP模型应用于开放词汇语义分割可节省训练成本。2)我们的SAN模型应兼具轻量级和高效性,以降低推理成本——为实现这一点,我们融合CLIP模型的中间特征以增强SAN模型的表征能力,并借助边适配器网络预测的注意力偏差促使CLIP模型聚焦于图像中的信息区域。3)我们的方法应使主流分割架构具备开放词汇分割能力——我们提出了P-SAN和R-SAN,分别支持广泛采用的逐像素分割和逐区域分割。通过实验,我们的方法在5个常用基准测试中取得了领先的性能,同时可训练参数和GFLOP数要少得多。例如,我们的R-SAN在所有基准测试中平均mIoU比之前最佳方法OvSeg高出2.3,而其可训练参数仅为OvSeg的6%,GFLOP数不到1%。此外,我们还对开放词汇语义分割数据集进行了全面分析,并验证了将经过充分优化的R-SAN模型迁移至视频分割任务的可行性。