Li Chi, Zia M Zeeshan, Tran Quoc-Huy, Yu Xiang, Hager Gregory D, Chandraker Manmohan
IEEE Trans Pattern Anal Mach Intell. 2019 Aug;41(8):1828-1843. doi: 10.1109/TPAMI.2018.2863285. Epub 2018 Aug 13.
Recent data-driven approaches to scene interpretation predominantly pose inference as an end-to-end black-box mapping, commonly performed by a Convolutional Neural Network (CNN). However, decades of work on perceptual organization in both human and machine vision suggest that there are often intermediate representations that are intrinsic to an inference task, and which provide essential structure to improve generalization. In this work, we explore an approach for injecting prior domain structure into neural network training by supervising hidden layers of a CNN with intermediate concepts that normally are not observed in practice. We formulate a probabilistic framework which formalizes these notions and predicts improved generalization via this deep supervision method. One advantage of this approach is that we are able to train only from synthetic CAD renderings of cluttered scenes, where concept values can be extracted, but apply the results to real images. Our implementation achieves the state-of-the-art performance of 2D/3D keypoint localization and image classification on real image benchmarks including KITTI, PASCAL VOC, PASCAL3D+, IKEA, and CIFAR100. We provide additional evidence that our approach outperforms alternative forms of supervision, such as multi-task networks.
最近,数据驱动的场景解释方法主要将推理视为一种端到端的黑箱映射,通常由卷积神经网络(CNN)执行。然而,几十年来在人类和机器视觉中关于感知组织的研究表明,推理任务中往往存在一些内在的中间表示,这些表示为提高泛化能力提供了重要的结构。在这项工作中,我们探索了一种方法,通过用实际中通常无法观察到的中间概念来监督CNN的隐藏层,从而将先验领域结构注入到神经网络训练中。我们制定了一个概率框架,将这些概念形式化,并通过这种深度监督方法预测泛化能力的提高。这种方法的一个优点是,我们能够仅从杂乱场景的合成CAD渲染图进行训练,在这些渲染图中可以提取概念值,但将结果应用于真实图像。我们的实现方法在包括KITTI、PASCAL VOC、PASCAL3D +、宜家家居和CIFAR100在内的真实图像基准测试中,实现了2D/3D关键点定位和图像分类的最优性能。我们还提供了额外的证据,证明我们的方法优于其他形式的监督,比如多任务网络。