一种用于RGB-D目标标注的多模态、判别式且空间不变的卷积神经网络。

A Multi-Modal, Discriminative and Spatially Invariant CNN for RGB-D Object Labeling.

作者信息

Asif Umar, Bennamoun Mohammed, Sohel Ferdous A

出版信息

IEEE Trans Pattern Anal Mach Intell. 2018 Sep;40(9):2051-2065. doi: 10.1109/TPAMI.2017.2747134. Epub 2017 Aug 30.

DOI:10.1109/TPAMI.2017.2747134

Abstract

While deep convolutional neural networks have shown a remarkable success in image classification, the problems of inter-class similarities, intra-class variances, the effective combination of multi-modal data, and the spatial variability in images of objects remain to be major challenges. To address these problems, this paper proposes a novel framework to learn a discriminative and spatially invariant classification model for object and indoor scene recognition using multi-modal RGB-D imagery. This is achieved through three postulates: 1) spatial invariance $-$ this is achieved by combining a spatial transformer network with a deep convolutional neural network to learn features which are invariant to spatial translations, rotations, and scale changes, 2) high discriminative capability $-$ this is achieved by introducing Fisher encoding within the CNN architecture to learn features which have small inter-class similarities and large intra-class compactness, and 3) multi-modal hierarchical fusion$-$ this is achieved through the regularization of semantic segmentation to a multi-modal CNN architecture, where class probabilities are estimated at different hierarchical levels (i.e., image- and pixel-levels), and fused into a Conditional Random Field (CRF)-based inference hypothesis, the optimization of which produces consistent class labels in RGB-D images. Extensive experimental evaluations on RGB-D object and scene datasets, and live video streams (acquired from Kinect) show that our framework produces superior object and scene classification results compared to the state-of-the-art methods.

摘要

虽然深度卷积神经网络在图像分类方面取得了显著成功，但类间相似性、类内方差、多模态数据的有效组合以及物体图像中的空间变异性等问题仍然是主要挑战。为了解决这些问题，本文提出了一种新颖的框架，用于使用多模态RGB-D图像学习用于物体和室内场景识别的判别性和空间不变分类模型。这通过三个假设来实现：1）空间不变性——这是通过将空间变换网络与深度卷积神经网络相结合来学习对空间平移、旋转和尺度变化不变的特征来实现的；2）高判别能力——这是通过在卷积神经网络架构中引入Fisher编码来学习具有小类间相似性和大类内紧凑性的特征来实现的；3）多模态层次融合——这是通过将语义分割正则化到多模态卷积神经网络架构来实现的，其中在不同层次级别（即图像和像素级别）估计类概率，并融合到基于条件随机场（CRF）的推理假设中，对其进行优化可在RGB-D图像中产生一致的类标签。对RGB-D物体和场景数据集以及实时视频流（从Kinect获取）进行的广泛实验评估表明，与现有方法相比，我们的框架产生了更优的物体和场景分类结果。