IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):5980-5995. doi: 10.1109/TPAMI.2021.3083769. Epub 2022 Sep 14.
Deep visual models are susceptible to adversarial perturbations to inputs. Although these signals are carefully crafted, they still appear noise-like patterns to humans. This observation has led to the argument that deep visual representation is misaligned with human perception. We counter-argue by providing evidence of human-meaningful patterns in adversarial perturbations. We first propose an attack that fools a network to confuse a whole category of objects (source class) with a target label. Our attack also limits the unintended fooling by samples from non-sources classes, thereby circumscribing human-defined semantic notions for network fooling. We show that the proposed attack not only leads to the emergence of regular geometric patterns in the perturbations, but also reveals insightful information about the decision boundaries of deep models. Exploring this phenomenon further, we alter the 'adversarial' objective of our attack to use it as a tool to 'explain' deep visual representation. We show that by careful channeling and projection of the perturbations computed by our method, we can visualize a model's understanding of human-defined semantic notions. Finally, we exploit the explanability properties of our perturbations to perform image generation, inpainting and interactive image manipulation by attacking adversarialy robust 'classifiers'. In all, our major contribution is a novel pragmatic adversarial attack that is subsequently transformed into a tool to interpret the visual models. The article also makes secondary contributions in terms of establishing the utility of our attack beyond the adversarial objective with multiple interesting applications.
深度视觉模型容易受到输入的对抗性干扰。尽管这些信号是精心设计的,但它们在人类看来仍然像是噪声模式。这一观察结果导致了深度视觉表示与人类感知不一致的论点。我们通过提供对抗性干扰中存在人类有意义模式的证据来反驳这一观点。我们首先提出了一种攻击,该攻击可以欺骗网络将一整类物体(源类)与目标标签混淆。我们的攻击还通过来自非源类的样本限制了无意的欺骗,从而限定了网络欺骗的人类定义的语义概念。我们表明,所提出的攻击不仅导致了干扰中出现规则的几何模式,而且还揭示了有关深度模型决策边界的有见地的信息。进一步探索这一现象,我们改变了攻击的“对抗性”目标,将其用作“解释”深度视觉表示的工具。我们表明,通过仔细引导和投影我们方法计算的扰动,我们可以可视化模型对人类定义的语义概念的理解。最后,我们利用我们的扰动的可解释性特性,通过攻击对抗鲁棒的“分类器”来执行图像生成、修复和交互式图像操作。总之,我们的主要贡献是一种新颖的实用对抗性攻击,随后将其转化为解释视觉模型的工具。该文章还通过多个有趣的应用,在超越对抗性目标的方面,提供了关于我们攻击的实用性的次要贡献。