Chen Yuzhong, Xiao Zhenxiang, Pan Yi, Zhao Lin, Dai Haixing, Wu Zihao, Li Changhe, Zhang Tuo, Li Changying, Zhu Dajiang, Liu Tianming, Jiang Xi
IEEE Trans Neural Netw Learn Syst. 2025 May;36(5):9636-9647. doi: 10.1109/TNNLS.2024.3418527. Epub 2025 May 2.
Learning with little data is challenging but often inevitable in various application scenarios where the labeled data are limited and costly. Recently, few-shot learning (FSL) gained increasing attention because of its generalizability of prior knowledge to new tasks that contain only a few samples. However, for data-intensive models such as vision transformer (ViT), current fine-tuning-based FSL approaches are inefficient in knowledge generalization and, thus, degenerate the downstream task performances. In this article, we propose a novel mask-guided ViT (MG-ViT) to achieve an effective and efficient FSL on the ViT model. The key idea is to apply a mask on image patches to screen out the task-irrelevant ones and to guide the ViT focusing on task-relevant and discriminative patches during FSL. Particularly, MG-ViT only introduces an additional mask operation and a residual connection, enabling the inheritance of parameters from pretrained ViT without any other cost. To optimally select representative few-shot samples, we also include an active learning-based sample selection method to further improve the generalizability of MG-ViT-based FSL. We evaluate the proposed MG-ViT on classification, object detection, and segmentation tasks using gradient-weighted class activation mapping (Grad-CAM) to generate masks. The experimental results show that the MG-ViT model significantly improves the performance and efficiency compared with general fine-tuning-based ViT and ResNet models, providing novel insights and a concrete approach toward generalizing data-intensive and large-scale deep learning models for FSL.
在各种标注数据有限且成本高昂的应用场景中,小数据学习颇具挑战性但又往往不可避免。近来,少样本学习(FSL)因其能将先验知识推广到仅包含少量样本的新任务上而受到越来越多的关注。然而,对于诸如视觉Transformer(ViT)这样的数据密集型模型,当前基于微调的FSL方法在知识泛化方面效率低下,进而导致下游任务性能退化。在本文中,我们提出了一种新颖的掩码引导ViT(MG-ViT),以在ViT模型上实现高效的FSL。关键思想是在图像块上应用一个掩码,以筛选出与任务无关的块,并在少样本学习期间引导ViT专注于与任务相关且具有区分性的块。特别地,MG-ViT仅引入了一个额外的掩码操作和一个残差连接,能够在不产生任何其他成本的情况下继承预训练ViT的参数。为了最优地选择具有代表性的少样本,我们还纳入了一种基于主动学习的样本选择方法,以进一步提高基于MG-ViT的FSL的泛化能力。我们使用梯度加权类激活映射(Grad-CAM)来生成掩码,在分类、目标检测和分割任务上评估所提出的MG-ViT。实验结果表明,与基于普通微调的ViT和ResNet模型相比,MG-ViT模型显著提高了性能和效率,为少样本学习推广数据密集型和大规模深度学习模型提供了新颖的见解和具体方法。