IEEE Trans Image Process. 2019 May;28(5):2116-2125. doi: 10.1109/TIP.2018.2881920. Epub 2018 Nov 16.
Deep convolutional neural networks (CNNs) have revolutionized the computer vision research and have seen unprecedented adoption for multiple tasks, such as classification, detection, and caption generation. However, they offer little transparency into their inner workings and are often treated as black boxes that deliver excellent performance. In this paper, we aim at alleviating this opaqueness of CNNs by providing visual explanations for the network's predictions. Our approach can analyze a variety of CNN-based models trained for computer vision applications, such as object recognition and caption generation. Unlike the existing methods, we achieve this via unraveling the forward pass operation. The proposed method exploits feature dependencies across the layer hierarchy and uncovers the discriminative image locations that guide the network's predictions. We name these locations CNN fixations, loosely analogous to human eye fixations. Our approach is a generic method that requires no architectural changes, additional training, or gradient computation, and computes the important image locations (CNN fixations). We demonstrate through a variety of applications that our approach is able to localize the discriminative image locations across different network architectures, diverse vision tasks, and data modalities.
深度卷积神经网络 (CNN) 彻底改变了计算机视觉研究,并在分类、检测和标题生成等多个任务中得到了前所未有的应用。然而,它们对内部工作机制的透明度很低,通常被视为提供出色性能的黑盒。在本文中,我们旨在通过为网络的预测提供可视化解释来减轻 CNN 的这种不透明性。我们的方法可以分析为计算机视觉应用(如目标识别和标题生成)训练的各种基于 CNN 的模型。与现有方法不同,我们通过展开前向传递操作来实现这一点。所提出的方法利用了跨层层次结构的特征依赖性,并揭示了指导网络预测的有区别的图像位置。我们将这些位置命名为 CNN 注视点,类似于人类眼睛的注视点。我们的方法是一种通用方法,不需要进行架构更改、额外的训练或梯度计算,并且可以计算重要的图像位置(CNN 注视点)。我们通过各种应用程序证明,我们的方法能够在不同的网络架构、不同的视觉任务和数据模态中定位有区别的图像位置。