Li Jingyao, Chen Pengguang, Qian Shengju, Liu Shu, Jia Jiaya
IEEE Trans Pattern Anal Mach Intell. 2024 Dec;46(12):11287-11297. doi: 10.1109/TPAMI.2024.3454647. Epub 2024 Nov 6.
Contrastive Language-Image Pre-training (CLIP) has recently shown great promise in pixel-level zero-shot learning tasks. However, existing approaches utilizing CLIP's text and patch embeddings to generate semantic masks often misidentify input pixels from unseen classes, leading to confusion between novel classes and semantically similar ones. In this work, we propose a novel approach, TagCLIP (Trusty-aware guided CLIP), to address this issue. We disentangle the ill-posed optimization problem into two parallel processes: semantic matching performed individually and reliability judgment for improving discrimination ability. Building on the idea of special tokens in language modeling representing sentence-level embeddings, we introduce a trusty token that enables distinguishing novel classes from known ones in prediction. To evaluate our approach, we conduct experiments on two benchmark datasets, PASCAL VOC 2012 and COCO-Stuff 164 K. Our results show that TagCLIP improves the Intersection over Union (IoU) of unseen classes by 7.4% and 1.7%, respectively, with negligible overheads. The code is available at here.
对比语言-图像预训练(CLIP)最近在像素级零样本学习任务中展现出了巨大潜力。然而,现有的利用CLIP的文本和补丁嵌入来生成语义掩码的方法,常常会误识别来自未见类别的输入像素,导致新类别与语义相似类别之间产生混淆。在这项工作中,我们提出了一种新颖的方法TagCLIP(可信感知引导的CLIP)来解决这个问题。我们将不适定的优化问题分解为两个并行过程:分别进行语义匹配以及进行可靠性判断以提高辨别能力。基于语言建模中表示句子级嵌入的特殊令牌的思想,我们引入了一个可信令牌,它能够在预测中区分新类别和已知类别。为了评估我们的方法,我们在两个基准数据集PASCAL VOC 2012和COCO-Stuff 164 K上进行了实验。我们的结果表明,TagCLIP分别将未见类别的交并比(IoU)提高了7.4%和1.7%,且开销可忽略不计。代码可在此处获取。