Zhou Shichao, Li Haoyan, Wang Zhuowei, Zhang Zekai
Key Laboratory of Information and Communication Systems, Ministry of Information Industry, Beijing Information Science and Technology University, Beijing, China.
Front Neurosci. 2024 Feb 12;18:1349204. doi: 10.3389/fnins.2024.1349204. eCollection 2024.
State-of-the-art image object detection computational models require an intensive parameter fine-tuning stage (using deep convolution network, etc). with tens or hundreds of training examples. In contrast, human intelligence can robustly learn a new concept from just a few instances (i.e., few-shot detection). The distinctive perception mechanisms between these two families of systems enlighten us to revisit classical handcraft local descriptors (e.g., SIFT, HOG, etc.) as well as non-parametric visual models, which innately require no learning/training phase. Herein, we claim that the inferior performance of these local descriptors mainly results from a lack of global structure sense. To address this issue, we refine local descriptors with spatial contextual attention of neighbor affinities and then embed the local descriptors into discriminative subspace guided by Kernel-InfoNCE loss. Differing from conventional quantization of local descriptors in high-dimensional feature space or isometric dimension reduction, we actually seek a brain-inspired few-shot feature representation for the object manifold, which combines data-independent primitive representation and semantic context learning and thus helps with generalization. The obtained embeddings as pattern vectors/tensors permit us an accelerated but non-parametric visual similarity computation as the decision rule for final detection. Our approach to few-shot object detection is nearly learning-free, and experiments on remote sensing imageries (approximate 2-D affine space) confirm the efficacy of our model.
最先进的图像目标检测计算模型需要一个密集的参数微调阶段(使用深度卷积网络等),且需要数十个或数百个训练示例。相比之下,人类智能仅从少数示例(即少样本检测)就能稳健地学习新概念。这两类系统之间独特的感知机制启发我们重新审视经典的手工局部描述符(如SIFT、HOG等)以及非参数视觉模型,这些模型本质上不需要学习/训练阶段。在此,我们认为这些局部描述符的性能较差主要是由于缺乏全局结构感。为了解决这个问题,我们利用邻居亲和力的空间上下文注意力来细化局部描述符,然后将局部描述符嵌入到由核信息噪声对比估计损失引导的判别性子空间中。与在高维特征空间中对局部描述符进行传统量化或等距降维不同,我们实际上是为目标流形寻找一种受大脑启发的少样本特征表示,它结合了与数据无关的原始表示和语义上下文学习,从而有助于泛化。作为模式向量/张量获得的嵌入允许我们进行加速但非参数的视觉相似性计算,作为最终检测的决策规则。我们的少样本目标检测方法几乎无需学习,并且在遥感图像(近似二维仿射空间)上的实验证实了我们模型的有效性。