Suri R E, Schultz W
Institute of Physiology, University of Fribourg, Switzerland.
Exp Brain Res. 1998 Aug;121(3):350-4. doi: 10.1007/s002210050467.
Dopamine neurons appear to code an error in the prediction of reward. They are activated by unpredicted rewards, are not influenced by predicted rewards, and are depressed when a predicted reward is omitted. After conditioning, they respond to reward-predicting stimuli in a similar manner. With these characteristics, the dopamine response strongly resembles the predictive reinforcement teaching signal of neural network models implementing the temporal difference learning algorithm. This study explored a neural network model that used a reward-prediction error signal strongly resembling dopamine responses for learning movement sequences. A different stimulus was presented in each step of the sequence and required a different movement reaction, and reward occurred at the end of the correctly performed sequence. The dopamine-like predictive reinforcement signal efficiently allowed the model to learn long sequences. By contrast, learning with an unconditional reinforcement signal required synaptic eligibility traces of longer and biologically less-plausible durations for obtaining satisfactory performance. Thus, dopamine-like neuronal signals constitute excellent teaching signals for learning sequential behavior.
多巴胺神经元似乎对奖励预测中的误差进行编码。它们会被意外的奖励激活,不受预期奖励的影响,而当预期奖励被省略时则会受到抑制。经过条件作用后,它们以类似的方式对奖励预测刺激做出反应。基于这些特性,多巴胺反应与实施时间差分学习算法的神经网络模型的预测强化教学信号极为相似。本研究探索了一种神经网络模型,该模型使用与多巴胺反应极为相似的奖励预测误差信号来学习运动序列。序列的每个步骤呈现不同的刺激,并需要不同的运动反应,且奖励出现在正确执行序列的末尾。类似多巴胺的预测强化信号有效地使模型能够学习长序列。相比之下,使用无条件强化信号进行学习需要更长且生物学上不太合理时长的突触资格痕迹才能获得令人满意的表现。因此,类似多巴胺的神经元信号构成了用于学习序列行为的极佳教学信号。