Department of Neuroscience, University of Minnesota, Minneapolis, Minnesota, United States of America.
PLoS One. 2009 Oct 20;4(10):e7362. doi: 10.1371/journal.pone.0007362.
Temporal-difference (TD) algorithms have been proposed as models of reinforcement learning (RL). We examine two issues of distributed representation in these TD algorithms: distributed representations of belief and distributed discounting factors. Distributed representation of belief allows the believed state of the world to distribute across sets of equivalent states. Distributed exponential discounting factors produce hyperbolic discounting in the behavior of the agent itself. We examine these issues in the context of a TD RL model in which state-belief is distributed over a set of exponentially-discounting "micro-Agents", each of which has a separate discounting factor (gamma). Each microAgent maintains an independent hypothesis about the state of the world, and a separate value-estimate of taking actions within that hypothesized state. The overall agent thus instantiates a flexible representation of an evolving world-state. As with other TD models, the value-error (delta) signal within the model matches dopamine signals recorded from animals in standard conditioning reward-paradigms. The distributed representation of belief provides an explanation for the decrease in dopamine at the conditioned stimulus seen in overtrained animals, for the differences between trace and delay conditioning, and for transient bursts of dopamine seen at movement initiation. Because each microAgent also includes its own exponential discounting factor, the overall agent shows hyperbolic discounting, consistent with behavioral experiments.
时间差分 (TD) 算法已被提议作为强化学习 (RL) 的模型。我们检查了这些 TD 算法中的两个分布式表示问题:信念的分布式表示和分布式折扣因素。信念的分布式表示允许所相信的世界状态分布在一组等效状态上。分布式指数折扣因素在代理自身的行为中产生双曲线折扣。我们在一个 TD RL 模型的背景下检查这些问题,其中状态-信念分布在一组指数折扣的“微代理”上,每个代理都有一个单独的折扣因素 (γ)。每个微代理对世界的状态都有一个独立的假设,并对在假设状态下采取行动有一个单独的价值估计。因此,总体代理实例化了一个不断发展的世界状态的灵活表示。与其他 TD 模型一样,模型中的价值误差 (δ) 信号与在标准条件奖励范式中从动物记录的多巴胺信号相匹配。信念的分布式表示为过度训练动物中在条件刺激下看到的多巴胺减少、痕迹和延迟条件之间的差异以及在运动启动时看到的多巴胺短暂爆发提供了一种解释。由于每个微代理还包括其自己的指数折扣因素,因此整体代理表现出双曲线折扣,与行为实验一致。