IEEE Trans Neural Netw Learn Syst. 2022 Oct;33(10):5766-5774. doi: 10.1109/TNNLS.2021.3071370. Epub 2022 Oct 5.
In classification, the use of 0-1 loss is preferable since the minimizer of 0-1 risk leads to the Bayes optimal classifier. However, due to the nonconvexity of 0-1 loss, this optimization problem is NP-hard. Therefore, many convex surrogate loss functions have been adopted. Previous works have shown that if a Bayes-risk consistent loss function is used as a surrogate, the minimizer of the empirical surrogate risk can achieve the Bayes optimal classifier as the sample size tends to infinity. Nevertheless, the comparison of convergence rates of minimizers of different empirical surrogate risks to the Bayes optimal classifier has rarely been studied. Which characterization of the surrogate loss determines its convergence rate to the Bayes optimal classifier? Can we modify the loss function to achieve a faster convergence rate? In this article, we study the convergence rates of empirical surrogate minimizers to the Bayes optimal classifier. Specifically, we introduce the notions of consistency intensity and conductivity to characterize a surrogate loss function and exploit this notion to obtain the rate of convergence from an empirical surrogate risk minimizer to the Bayes optimal classifier, enabling fair comparisons of the excess risks of different surrogate risk minimizers. The main result of this article has practical implications including: 1) showing that hinge loss (SVM) is superior to logistic loss (Logistic regression) and exponential loss (Adaboost) in the sense that its empirical minimizer converges faster to the Bayes optimal classifier and 2) guiding the design of new loss functions to speed up the convergence rate to the Bayes optimal classifier with a data-dependent loss correction method inspired by our theorems.
在分类中,使用 0-1 损失是可取的,因为 0-1 风险的最小化导致贝叶斯最优分类器。然而,由于 0-1 损失的非凸性,这个优化问题是 NP 难的。因此,许多凸的替代损失函数已经被采用。以前的工作表明,如果使用贝叶斯风险一致的损失函数作为替代,经验替代风险的最小化可以在样本量趋于无穷大时达到贝叶斯最优分类器。然而,很少有研究比较不同经验替代风险最小化器对贝叶斯最优分类器的收敛速度。替代损失的哪种特征决定了它对贝叶斯最优分类器的收敛速度?我们能否修改损失函数以实现更快的收敛速度?在本文中,我们研究了经验替代最小化器对贝叶斯最优分类器的收敛速度。具体来说,我们引入了一致性强度和电导率的概念来刻画替代损失函数,并利用这个概念来获得从经验替代风险最小化器到贝叶斯最优分类器的收敛速度,从而可以公平地比较不同替代风险最小化器的超额风险。本文的主要结果具有实际意义,包括:1)表明铰链损失(支持向量机)优于逻辑损失(逻辑回归)和指数损失(Adaboost),因为其经验最小化器更快地收敛到贝叶斯最优分类器;2)指导新损失函数的设计,通过受我们定理启发的数据相关损失校正方法来加快收敛到贝叶斯最优分类器的速度。