Guo Shi-Cheng, Liu Shang-Kun, Wang Jing-Yu, Zheng Wei-Min, Jiang Cheng-Yu
College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China.
Entropy (Basel). 2023 Sep 18;25(9):1353. doi: 10.3390/e25091353.
Recent research has shown that visual-text pretrained models perform well in traditional vision tasks. CLIP, as the most influential work, has garnered significant attention from researchers. Thanks to its excellent visual representation capabilities, many recent studies have used CLIP for pixel-level tasks. We explore the potential abilities of CLIP in the field of few-shot segmentation. The current mainstream approach is to utilize support and query features to generate class prototypes and then use the prototype features to match image features. We propose a new method that utilizes CLIP to extract text features for a specific class. These text features are then used as training samples to participate in the model's training process. The addition of text features enables model to extract features that contain richer semantic information, thus making it easier to capture potential class information. To better match the query image features, we also propose a new prototype generation method that incorporates multi-modal fusion features of text and images in the prototype generation process. Adaptive query prototypes were generated by combining foreground and background information from the images with the multi-modal support prototype, thereby allowing for a better matching of image features and improved segmentation accuracy. We provide a new perspective to the task of few-shot segmentation in multi-modal scenarios. Experiments demonstrate that our proposed method achieves excellent results on two common datasets, PASCAL-5i and COCO-20i.
最近的研究表明,视觉文本预训练模型在传统视觉任务中表现良好。CLIP作为最具影响力的工作,受到了研究人员的广泛关注。由于其出色的视觉表征能力,最近许多研究将CLIP用于像素级任务。我们探索CLIP在少样本分割领域的潜在能力。当前的主流方法是利用支持特征和查询特征来生成类原型,然后使用原型特征来匹配图像特征。我们提出了一种新方法,利用CLIP为特定类提取文本特征。然后将这些文本特征用作训练样本参与模型的训练过程。文本特征的加入使模型能够提取包含更丰富语义信息的特征,从而更容易捕捉潜在的类信息。为了更好地匹配查询图像特征,我们还提出了一种新的原型生成方法,该方法在原型生成过程中结合了文本和图像的多模态融合特征。通过将图像中的前景和背景信息与多模态支持原型相结合,生成自适应查询原型,从而实现图像特征的更好匹配并提高分割精度。我们为多模态场景下的少样本分割任务提供了一个新的视角。实验表明,我们提出的方法在两个常用数据集PASCAL-5i和COCO-20i上取得了优异的结果。