Baldi Pierre, Sadowski Peter
Department of Computer Science University of California, Irvine Irvine, CA 92697-3435.
Artif Intell. 2014 May;210:78-122. doi: 10.1016/j.artint.2014.02.004.
Dropout is a recently introduced algorithm for training neural network by randomly dropping units during training to prevent their co-adaptation. A mathematical analysis of some of the static and dynamic properties of dropout is provided using Bernoulli gating variables, general enough to accommodate dropout on units or connections, and with variable rates. The framework allows a complete analysis of the ensemble averaging properties of dropout in linear networks, which is useful to understand the non-linear case. The ensemble averaging properties of dropout in non-linear logistic networks result from three fundamental equations: (1) the approximation of the expectations of logistic functions by normalized geometric means, for which bounds and estimates are derived; (2) the algebraic equality between normalized geometric means of logistic functions with the logistic of the means, which mathematically characterizes logistic functions; and (3) the linearity of the means with respect to sums, as well as products of independent variables. The results are also extended to other classes of transfer functions, including rectified linear functions. Approximation errors tend to cancel each other and do not accumulate. Dropout can also be connected to stochastic neurons and used to predict firing rates, and to backpropagation by viewing the backward propagation as ensemble averaging in a dropout linear network. Moreover, the convergence properties of dropout can be understood in terms of stochastic gradient descent. Finally, for the regularization properties of dropout, the expectation of the dropout gradient is the gradient of the corresponding approximation ensemble, regularized by an adaptive weight decay term with a propensity for self-consistent variance minimization and sparse representations.
随机失活(Dropout)是一种最近引入的用于训练神经网络的算法,它在训练过程中随机丢弃单元以防止它们共同适应。利用伯努利门控变量对随机失活的一些静态和动态特性进行了数学分析,该变量具有足够的通用性,能够适应单元或连接上的随机失活,且失活率可变。该框架允许对线性网络中随机失活的集成平均特性进行完整分析,这有助于理解非线性情况。非线性逻辑网络中随机失活的集成平均特性源于三个基本方程:(1)通过归一化几何均值对逻辑函数期望的近似,为此推导了边界和估计;(2)逻辑函数的归一化几何均值与均值的逻辑之间的代数等式,它从数学上刻画了逻辑函数;(3)均值相对于和以及自变量乘积的线性关系。结果还扩展到了其他类别的传递函数,包括整流线性函数。近似误差往往相互抵消,不会累积。随机失活还可以与随机神经元相关联,并用于预测 firing 率,以及通过将反向传播视为随机失活线性网络中的集成平均来与反向传播相关联。此外,随机失活的收敛特性可以用随机梯度下降来理解。最后,对于随机失活的正则化特性,随机失活梯度的期望是相应近似集成的梯度,通过具有自洽方差最小化和稀疏表示倾向的自适应权重衰减项进行正则化。