Lu Yehao, Cai Chaoxiang, Su Wei, Zheng Guangcong, Wang Wenjie, Li Xuewei, Li Xi
IEEE Trans Image Process. 2025;34:5215-5227. doi: 10.1109/TIP.2025.3573498.
Few-shot fine-tuning of pre-trained vision-language models (VLMs) for downstream tasks has gained widespread attention for reducing data annotation efforts while maintaining high performance. However, we observe that VLMs excel in excluding most incorrect classes in fine-grained recognition tasks, but struggles with a small set of confusing categories, which are typically highly similar subspecies. Existing few-shot fine-tuning methods attempt to directly recognize the correct category among all predefined classes, limiting their ability to capture discriminative features for those confusing categories. This raises an intriguing question: Can we specifically extract useful information from confusing classes to enhance fine-grained recognition performance? Based on this insight, we propose a hierarchical few-shot fine-tuning framework to address the severe confusion problem while ensuring the interpretability, namely Attribute-Decoupled Discriminator (AttrDD). Instead of thinking once among all classes, AttrDD employs a two-stage recognition, "think through" then "think smart". Specifically, in the first phase, a representative VLM, CLIP, is fine-tuned to select the Top-K confusing classes. In the second phase, we leverage the knowledge of large language models (LLMs) to generate fixed format descriptions of attribute differences between these confusing classes via in-context learning. Attribute-decoupled classifications are then conducted to capture fine-grained discriminative features. To achieve parameter-efficient fine-tuning, we introduce a lightweight attention adapter for each phase to align image features with task-specific textual features and LLM-generated textual features. Extensive experiments on 9 fine-grained recognition benchmarks demonstrate that AttrDD consistently outperforms existing baselines by wide margins.
用于下游任务的预训练视觉语言模型(VLM)的少样本微调在减少数据标注工作量的同时保持高性能,已受到广泛关注。然而,我们观察到VLM在细粒度识别任务中能够很好地排除大多数错误类别,但在处理一小部分容易混淆的类别时存在困难,这些类别通常是高度相似的亚种。现有的少样本微调方法试图在所有预定义类别中直接识别正确类别,限制了它们捕捉这些混淆类别的判别特征的能力。这就引出了一个有趣的问题:我们能否从混淆类别中专门提取有用信息来提高细粒度识别性能?基于这一见解,我们提出了一种分层少样本微调框架来解决严重的混淆问题,同时确保可解释性,即属性解耦鉴别器(AttrDD)。AttrDD不是在所有类别中一次性思考,而是采用两阶段识别,即“深入思考”然后“明智思考”。具体来说,在第一阶段,对一个有代表性的VLM(CLIP)进行微调,以选择前K个混淆类别。在第二阶段,我们利用大语言模型(LLM)的知识,通过上下文学习生成这些混淆类别之间属性差异的固定格式描述。然后进行属性解耦分类,以捕捉细粒度的判别特征。为了实现参数高效的微调,我们为每个阶段引入了一个轻量级注意力适配器,以将图像特征与特定任务的文本特征以及LLM生成的文本特征对齐。在9个细粒度识别基准上进行的大量实验表明,AttrDD始终大幅优于现有的基线方法。