Rocks Jason W, Mehta Pankaj
Department of Physics, Boston University, Boston, Massachusetts 02215, USA.
Faculty of Computing and Data Sciences, Boston University, Boston, Massachusetts 02215, USA.
Phys Rev Res. 2022 Mar-May;4(1). doi: 10.1103/physrevresearch.4.013201. Epub 2022 Mar 15.
The bias-variance trade-off is a central concept in supervised learning. In classical statistics, increasing the complexity of a model (e.g., number of parameters) reduces bias but also increases variance. Until recently, it was commonly believed that optimal performance is achieved at intermediate model complexities which strike a balance between bias and variance. Modern Deep Learning methods flout this dogma, achieving state-of-the-art performance using "over-parameterized models" where the number of fit parameters is large enough to perfectly fit the training data. As a result, understanding bias and variance in over-parameterized models has emerged as a fundamental problem in machine learning. Here, we use methods from statistical physics to derive analytic expressions for bias and variance in two minimal models of over-parameterization (linear regression and two-layer neural networks with nonlinear data distributions), allowing us to disentangle properties stemming from the model architecture and random sampling of data. In both models, increasing the number of fit parameters leads to a phase transition where the training error goes to zero and the test error diverges as a result of the variance (while the bias remains finite). Beyond this threshold, the test error of the two-layer neural network decreases due to a monotonic decrease in the bias and variance in contrast with the classical bias-variance trade-off. We also show that in contrast with classical intuition, over-parameterized models can overfit even in the absence of noise and exhibit bias even if the student and teacher models match. We synthesize these results to construct a holistic understanding of generalization error and the bias-variance trade-off in over-parameterized models and relate our results to random matrix theory.
偏差-方差权衡是监督学习中的核心概念。在经典统计学中,增加模型的复杂度(例如参数数量)会降低偏差,但也会增加方差。直到最近,人们普遍认为在中间模型复杂度下可实现最优性能,这种复杂度能在偏差和方差之间取得平衡。现代深度学习方法却无视这一教条,使用“过参数化模型”实现了最优性能,其中拟合参数的数量足够大,足以完美拟合训练数据。因此,理解过参数化模型中的偏差和方差已成为机器学习中的一个基本问题。在这里,我们使用统计物理学方法,推导出过参数化的两个最小模型(线性回归和具有非线性数据分布的两层神经网络)中偏差和方差的解析表达式,使我们能够区分源于模型架构和数据随机采样的属性。在这两个模型中,增加拟合参数的数量会导致一个相变,即训练误差趋于零,而由于方差(偏差保持有限)测试误差发散。超过这个阈值,两层神经网络的测试误差会因偏差和方差的单调减小而降低,这与经典的偏差-方差权衡不同。我们还表明,与经典直觉相反,过参数化模型即使在没有噪声的情况下也可能过拟合,并且即使学生模型和教师模型匹配也会表现出偏差。我们综合这些结果,对过参数化模型中的泛化误差和偏差-方差权衡形成整体理解,并将我们的结果与随机矩阵理论联系起来。