Ludvig Elliot A, Sutton Richard S, Kehoe E James
University of Alberta, Edmonton, Alberta, Canada.
Neural Comput. 2008 Dec;20(12):3034-54. doi: 10.1162/neco.2008.11-07-654.
The phasic firing of dopamine neurons has been theorized to encode a reward-prediction error as formalized by the temporal-difference (TD) algorithm in reinforcement learning. Most TD models of dopamine have assumed a stimulus representation, known as the complete serial compound, in which each moment in a trial is distinctly represented. We introduce a more realistic temporal stimulus representation for the TD model. In our model, all external stimuli, including rewards, spawn a series of internal microstimuli, which grow weaker and more diffuse over time. These microstimuli are used by the TD learning algorithm to generate predictions of future reward. This new stimulus representation injects temporal generalization into the TD model and enhances correspondence between model and data in several experiments, including those when rewards are omitted or received early. This improved fit mostly derives from the absence of large negative errors in the new model, suggesting that dopamine alone can encode the full range of TD errors in these situations.
多巴胺神经元的相位性放电被理论化为编码一种奖励预测误差,这在强化学习中由时间差分(TD)算法形式化。大多数多巴胺的TD模型都假设了一种刺激表征,即完全序列复合,其中试验中的每个时刻都被清晰地表征。我们为TD模型引入了一种更现实的时间刺激表征。在我们的模型中,所有外部刺激,包括奖励,都会产生一系列内部微刺激,这些微刺激会随着时间的推移而变弱且更分散。TD学习算法利用这些微刺激来生成对未来奖励的预测。这种新的刺激表征将时间泛化注入到TD模型中,并在多个实验中增强了模型与数据之间的对应关系,包括奖励被省略或提前接收的实验。这种改进的拟合主要源于新模型中没有大的负误差,这表明在这些情况下仅多巴胺就能编码TD误差的全范围。