IEEE Trans Cybern. 2022 Jul;52(7):5682-5694. doi: 10.1109/TCYB.2020.2981480. Epub 2022 Jul 4.
Accurately classifying sceneries with different spatial configurations is an indispensable technique in computer vision and intelligent systems, for example, scene parsing, robot motion planning, and autonomous driving. Remarkable performance has been achieved by the deep recognition models in the past decade. As far as we know, however, these deep architectures are incapable of explicitly encoding the human visual perception, that is, the sequence of gaze movements and the subsequent cognitive processes. In this article, a biologically inspired deep model is proposed for scene classification, where the human gaze behaviors are robustly discovered and represented by a unified deep active learning (UDAL) framework. More specifically, to characterize objects' components with varied sizes, an objectness measure is employed to decompose each scenery into a set of semantically aware object patches. To represent each region at a low level, a local-global feature fusion scheme is developed which optimally integrates multimodal features by automatically calculating each feature's weight. To mimic the human visual perception of various sceneries, we develop the UDAL that hierarchically represents the human gaze behavior by recognizing semantically important regions within the scenery. Importantly, UDAL combines the semantically salient region detection and the deep gaze shifting path (GSP) representation learning into a principled framework, where only the partial semantic tags are required. Meanwhile, by incorporating the sparsity penalty, the contaminated/redundant low-level regional features can be intelligently avoided. Finally, the learned deep GSP features from the entire scene images are integrated to form an image kernel machine, which is subsequently fed into a kernel SVM to classify different sceneries. Experimental evaluations on six well-known scenery sets (including remote sensing images) have shown the competitiveness of our approach.
准确地对具有不同空间配置的场景进行分类是计算机视觉和智能系统中不可或缺的技术,例如场景解析、机器人运动规划和自动驾驶。在过去的十年中,深度识别模型已经取得了显著的性能。然而,据我们所知,这些深度架构无法显式地编码人类的视觉感知,即注视运动的顺序和随后的认知过程。在本文中,我们提出了一种受生物启发的深度模型用于场景分类,其中通过统一的深度主动学习 (UDAL) 框架稳健地发现和表示人类的注视行为。更具体地说,为了用不同大小的对象特征来描述对象的组成部分,我们采用了一种对象度量方法将每个场景分解为一组语义感知的对象补丁。为了在低层次上表示每个区域,我们开发了一种局部-全局特征融合方案,通过自动计算每个特征的权重来最优地整合多模态特征。为了模拟人类对各种场景的视觉感知,我们开发了 UDAL,通过识别场景中的语义重要区域来分层表示人类的注视行为。重要的是,UDAL 将语义突出区域检测和深度注视转移路径 (GSP) 表示学习结合到一个原则框架中,其中只需要部分语义标签。同时,通过引入稀疏惩罚,可以智能地避免污染/冗余的低水平区域特征。最后,将从整个场景图像中学习到的深度 GSP 特征集成在一起形成图像核机器,然后将其输入核支持向量机 (SVM) 以对不同的场景进行分类。在六个著名的场景集(包括遥感图像)上的实验评估表明了我们方法的竞争力。