IEEE Trans Vis Comput Graph. 2019 Dec;25(12):3244-3257. doi: 10.1109/TVCG.2018.2866793. Epub 2018 Aug 23.
Multi-view deep neural network is perhaps the most successful approach in 3D shape classification. However, the fusion of multi-view features based on max or average pooling lacks a view selection mechanism, limiting its application in, e.g., multi-view active object recognition by a robot. This paper presents VERAM, a view-enhanced recurrent attention model capable of actively selecting a sequence of views for highly accurate 3D shape classification. VERAM addresses an important issue commonly found in existing attention-based models, i.e., the unbalanced training of the subnetworks corresponding to next view estimation and shape classification. The classification subnetwork is easily overfitted while the view estimation one is usually poorly trained, leading to a suboptimal classification performance. This is surmounted by three essential view-enhancement strategies: 1) enhancing the information flow of gradient backpropagation for the view estimation subnetwork, 2) devising a highly informative reward function for the reinforcement training of view estimation and 3) formulating a novel loss function that explicitly circumvents view duplication. Taking grayscale image as input and AlexNet as CNN architecture, VERAM with 9 views achieves instance-level and class-level accuracy of 95.5 and 95.3 percent on ModelNet10, 93.7 and 92.1 percent on ModelNet40, both are the state-of-the-art performance under the same number of views.
多视角深度神经网络可能是 3D 形状分类中最成功的方法。然而,基于最大或平均池化的多视角特征融合缺乏视图选择机制,限制了其在机器人的多视角主动目标识别等应用。本文提出了 VERAM,这是一种基于视图增强的递归注意模型,能够主动选择一系列视图,实现高精度的 3D 形状分类。VERAM 解决了现有基于注意力模型中常见的一个重要问题,即下一个视图估计和形状分类的子网对应不平衡的训练。分类子网很容易过拟合,而视图估计子网通常训练不足,导致分类性能不理想。通过三个基本的视图增强策略来克服这一问题:1)增强视图估计子网的梯度反向传播信息流,2)为视图估计的强化训练设计一个高度信息丰富的奖励函数,3)制定一个明确避免视图重复的新损失函数。VERAM 以灰度图像为输入,以 AlexNet 作为 CNN 架构,在 ModelNet10 上达到了 95.5%和 95.3%的实例级和类级精度,在 ModelNet40 上达到了 93.7%和 92.1%的实例级和类级精度,这在相同数量的视图下都是最先进的性能。