Kato Naoki, Nota Yoshiki, Aoki Yoshimitsu
Department of Electrical Engineering, Faculty of Science and Technology, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama 223-8522, Kanagawa, Japan.
Meidensha Corporation, Tokyo 141-6029, Japan.
Sensors (Basel). 2024 Jun 4;24(11):3624. doi: 10.3390/s24113624.
Large vision-language models, such as Contrastive Vision-Language Pre-training (CLIP), pre-trained on large-scale image-text datasets, have demonstrated robust zero-shot transfer capabilities across various downstream tasks. To further enhance the few-shot recognition performance of CLIP, Tip-Adapter augments the CLIP model with an adapter that incorporates a key-value cache model constructed from the few-shot training set. This approach enables training-free adaptation and has shown significant improvements in few-shot recognition, especially with additional fine-tuning. However, the size of the adapter increases in proportion to the number of training samples, making it difficult to deploy in practical applications. In this paper, we propose a novel CLIP adaptation method, named Proto-Adapter, which employs a single-layer adapter of constant size regardless of the amount of training data and even outperforms Tip-Adapter. Proto-Adapter constructs the adapter's weights based on prototype representations for each class. By aggregating the features of the training samples, it successfully reduces the size of the adapter without compromising performance. Moreover, the performance of the model can be further enhanced by fine-tuning the adapter's weights using a distance margin penalty, which imposes additional inter-class discrepancy to the output logits. We posit that this training scheme allows us to obtain a model with a discriminative decision boundary even when trained with a limited amount of data. We demonstrate the effectiveness of the proposed method through extensive experiments of few-shot classification on diverse datasets.
大型视觉语言模型,如在大规模图像-文本数据集上进行预训练的对比视觉语言预训练模型(CLIP),已在各种下游任务中展现出强大的零样本迁移能力。为了进一步提升CLIP的少样本识别性能,Tip-Adapter通过一个适配器对CLIP模型进行增强,该适配器包含一个基于少样本训练集构建的键值缓存模型。这种方法实现了无需训练的自适应,并且在少样本识别方面有显著提升,特别是在进行额外的微调时。然而,适配器的大小与训练样本数量成比例增加,这使得在实际应用中难以部署。在本文中,我们提出了一种名为Proto-Adapter的新型CLIP自适应方法,它采用固定大小的单层适配器,无论训练数据量多少,甚至性能优于Tip-Adapter。Proto-Adapter基于每个类别的原型表示来构建适配器的权重。通过聚合训练样本的特征,它成功减小了适配器的大小而不影响性能。此外,通过使用距离边际惩罚对适配器的权重进行微调,可以进一步提升模型的性能,这会给输出对数its施加额外的类间差异。我们认为,即使在使用有限数据进行训练时,这种训练方案也能让我们获得具有判别性决策边界的模型。我们通过在不同数据集上进行的大量少样本分类实验证明了所提方法的有效性。