Lyu Bochen, Zhu Zhanxing
IEEE Trans Pattern Anal Mach Intell. 2025 Sep;47(9):8025-8039. doi: 10.1109/TPAMI.2025.3575618.
Adversarial training has been empirically demonstrated as an effective strategy to improve the robustness of deep neural networks (DNNs) against adversarial examples. However, the underlying reason of its effectiveness is still non-transparent. In this paper we conduct both extensive theoretical and empirical analysis on the implicit bias induced by adversarial training from a generalized margin perspective. Our results focus on adversarial training for homogeneous DNNs. In particular, (i) For deep linear networks with $\ell _{p}$ℓp-norm perturbation, we show that weight matrices of adjacent layers get aligned and the converged parameters maximize the margin of adversarial examples, which can be further viewed as a generalized margin of the original dataset that can be achieved by an interpolation solution between $\ell _{2}$ℓ2-SVM and $\ell _{q}$ℓq-SVM where $1/p + 1/q=1$1/p+1/q=1. (ii) For general homogeneous DNNs, including both linear and nonlinear ones, we investigate adversarial training with a variety of adversarial perturbations in a unified manner. Specifically, we show that the direction of the limit point of parameters converges to a KKT point of a constrained optimization problem that aims to maximize the margin for adversarial examples. Additionally, as an application of this general result for two special linear homogeneous DNNs, diagonal linear networks and linear convolutional networks, we show that adversarial training with $\ell _{p}$ℓp-norm perturbation equivalently minimizes an interpolation norm that depends on the depth, the architecture, and the value of $p$p in the predictor space. Extensive experiments are conducted to verify theoretical claims. Our results theoretically provide the basis for the longstanding folklore Madry et al. 2018 that adversarial training modifies the decision boundary by utilizing adversarial examples to improve robustness, and potentially provide insights for designing new robust training strategies.
对抗训练已通过实验证明是提高深度神经网络(DNN)对对抗样本鲁棒性的有效策略。然而,其有效性的根本原因仍不明确。在本文中,我们从广义边际的角度对对抗训练引起的隐式偏差进行了广泛的理论和实证分析。我们的结果聚焦于同质DNN的对抗训练。具体而言,(i)对于具有$\ell _{p}$范数扰动的深度线性网络,我们表明相邻层的权重矩阵会对齐,并且收敛后的参数会使对抗样本的边际最大化,这可以进一步看作是原始数据集的广义边际,它可以通过$\ell _{2}$支持向量机(SVM)和$\ell _{q}$SVM之间的插值解来实现,其中$1/p + 1/q = 1$。(ii)对于包括线性和非线性的一般同质DNN,我们以统一的方式研究了具有各种对抗扰动的对抗训练。具体来说,我们表明参数极限点的方向收敛到一个约束优化问题的KKT点,该问题旨在最大化对抗样本的边际。此外,作为这一一般结果在两个特殊线性同质DNN(对角线性网络和线性卷积网络)上的应用,我们表明使用$\ell _{p}$范数扰动的对抗训练等效地最小化了一个依赖于预测器空间中的深度、架构和$p$值的插值范数。我们进行了大量实验来验证理论主张。我们的结果在理论上为Madry等人2018年的长期经验法则提供了基础,即对抗训练通过利用对抗样本修改决策边界来提高鲁棒性,并可能为设计新的鲁棒训练策略提供见解。