赫布梯度下降：对数似然学习的统一观点。

Hebbian Descent: A Unified View on Log-Likelihood Learning.

作者信息

Melchior Jan, Schiewer Robin, Wiskott Laurenz

机构信息

Ruhr University Bochum, 44801 Bochum, Germany

出版信息

Neural Comput. 2024 Aug 19;36(9):1669-1712. doi: 10.1162/neco_a_01684.

DOI:10.1162/neco_a_01684

PMID:39163553

Abstract

This study discusses the negative impact of the derivative of the activation functions in the output layer of artificial neural networks, in particular in continual learning. We propose Hebbian descent as a theoretical framework to overcome this limitation, which is implemented through an alternative loss function for gradient descent we refer to as Hebbian descent loss. This loss is effectively the generalized log-likelihood loss and corresponds to an alternative weight update rule for the output layer wherein the derivative of the activation function is disregarded. We show how this update avoids vanishing error signals during backpropagation in saturated regions of the activation functions, which is particularly helpful in training shallow neural networks and deep neural networks where saturating activation functions are only used in the output layer. In combination with centering, Hebbian descent leads to better continual learning capabilities. It provides a unifying perspective on Hebbian learning, gradient descent, and generalized linear models, for all of which we discuss the advantages and disadvantages. Given activation functions with strictly positive derivative (as often the case in practice), Hebbian descent inherits the convergence properties of regular gradient descent. While established pairings of loss and output layer activation function (e.g., mean squared error with linear or cross-entropy with sigmoid/softmax) are subsumed by Hebbian descent, we provide general insights for designing arbitrary loss activation function combinations that benefit from Hebbian descent. For shallow networks, we show that Hebbian descent outperforms Hebbian learning, has a performance similar to regular gradient descent, and has a much better performance than all other tested update rules in continual learning. In combination with centering, Hebbian descent implements a forgetting mechanism that prevents catastrophic interference notably better than the other tested update rules. When training deep neural networks, our experimental results suggest that Hebbian descent has better or similar performance as gradient descent.

摘要

本研究探讨了人工神经网络输出层激活函数导数的负面影响，特别是在持续学习方面。我们提出赫布下降作为一个理论框架来克服这一限制，它通过一种用于梯度下降的替代损失函数来实现，我们将其称为赫布下降损失。这种损失实际上是广义对数似然损失，并且对应于输出层的一种替代权重更新规则，其中激活函数的导数被忽略。我们展示了这种更新如何在激活函数的饱和区域进行反向传播期间避免误差信号消失，这在训练仅在输出层使用饱和激活函数的浅层神经网络和深层神经网络时特别有帮助。结合中心化，赫布下降带来了更好的持续学习能力。它为赫布学习、梯度下降和广义线性模型提供了一个统一的视角，我们讨论了所有这些方法的优缺点。对于具有严格正导数的激活函数（实际中经常如此），赫布下降继承了常规梯度下降的收敛特性。虽然已有的损失函数与输出层激活函数的配对（例如，线性的均方误差或 sigmoid/softmax 的交叉熵）被赫布下降所包含，但我们提供了关于设计受益于赫布下降的任意损失 - 激活函数组合的一般见解。对于浅层网络，我们表明赫布下降优于赫布学习，具有与常规梯度下降相似的性能，并且在持续学习方面比所有其他测试的更新规则性能要好得多。结合中心化，赫布下降实现了一种遗忘机制，能比其他测试的更新规则更好地防止灾难性干扰。在训练深层神经网络时，我们的实验结果表明赫布下降具有与梯度下降相似或更好的性能。