IEEE Trans Image Process. 2021;30:3691-3704. doi: 10.1109/TIP.2021.3064256. Epub 2021 Mar 17.
This article presents a novel keypoints-based attention mechanism for visual recognition in still images. Deep Convolutional Neural Networks (CNNs) for recognizing images with distinctive classes have shown great success, but their performance in discriminating fine-grained changes is not at the same level. We address this by proposing an end-to-end CNN model, which learns meaningful features linking fine-grained changes using our novel attention mechanism. It captures the spatial structures in images by identifying semantic regions (SRs) and their spatial distributions, and is proved to be the key to modeling subtle changes in images. We automatically identify these SRs by grouping the detected keypoints in a given image. The "usefulness" of these SRs for image recognition is measured using our innovative attentional mechanism focusing on parts of the image that are most relevant to a given task. This framework applies to traditional and fine-grained image recognition tasks and does not require manually annotated regions (e.g. bounding-box of body parts, objects, etc.) for learning and prediction. Moreover, the proposed keypoints-driven attention mechanism can be easily integrated into the existing CNN models. The framework is evaluated on six diverse benchmark datasets. The model outperforms the state-of-the-art approaches by a considerable margin using Distracted Driver V1 (Acc: 3.39%), Distracted Driver V2 (Acc: 6.58%), Stanford-40 Actions (mAP: 2.15%), People Playing Musical Instruments (mAP: 16.05%), Food-101 (Acc: 6.30%) and Caltech-256 (Acc: 2.59%) datasets.
本文提出了一种新颖的基于关键点的注意力机制,用于静态图像中的视觉识别。用于识别具有明显类别图像的深度卷积神经网络(CNN)取得了巨大的成功,但在识别细微变化方面的性能却不尽相同。我们通过提出一个端到端的 CNN 模型来解决这个问题,该模型使用我们的新注意力机制学习将细微变化联系起来的有意义的特征。它通过识别语义区域(SR)及其空间分布来捕获图像中的空间结构,并被证明是对图像中细微变化进行建模的关键。我们通过在给定图像中分组检测到的关键点来自动识别这些 SR。通过我们的创新注意力机制,测量这些 SR 对图像识别的“有用性”,该机制专注于与给定任务最相关的图像部分。该框架适用于传统和细微的图像识别任务,并且不需要手动注释区域(例如身体部位、对象等的边界框)进行学习和预测。此外,所提出的基于关键点的注意力机制可以很容易地集成到现有的 CNN 模型中。该框架在六个不同的基准数据集上进行了评估。该模型在使用分心驾驶员 V1(Acc:3.39%)、分心驾驶员 V2(Acc:6.58%)、斯坦福 40 个动作(mAP:2.15%)、人们演奏乐器(mAP:16.05%)、食物 101(Acc:6.30%)和 Caltech-256(Acc:2.59%)数据集方面,明显优于最先进的方法。