Daw Nathaniel D, Courville Aaron C, Touretzky David S
UCL, Gatsby Computational Neuroscience Unit, London, WC1N3AR, UK.
Neural Comput. 2006 Jul;18(7):1637-77. doi: 10.1162/neco.2006.18.7.1637.
Although the responses of dopamine neurons in the primate midbrain are well characterized as carrying a temporal difference (TD) error signal for reward prediction, existing theories do not offer a credible account of how the brain keeps track of past sensory events that may be relevant to predicting future reward. Empirically, these shortcomings of previous theories are particularly evident in their account of experiments in which animals were exposed to variation in the timing of events. The original theories mispredicted the results of such experiments due to their use of a representational device called a tapped delay line. Here we propose that a richer understanding of history representation and a better account of these experiments can be given by considering TD algorithms for a formal setting that incorporates two features not originally considered in theories of the dopaminergic response: partial observability (a distinction between the animal's sensory experience and the true underlying state of the world) and semi-Markov dynamics (an explicit account of variation in the intervals between events). The new theory situates the dopaminergic system in a richer functional and anatomical context, since it assumes (in accord with recent computational theories of cortex) that problems of partial observability and stimulus history are solved in sensory cortex using statistical modeling and inference and that the TD system predicts reward using the results of this inference rather than raw sensory data. It also accounts for a range of experimental data, including the experiments involving programmed temporal variability and other previously unmodeled dopaminergic response phenomena, which we suggest are related to subjective noise in animals' interval timing. Finally, it offers new experimental predictions and a rich theoretical framework for designing future experiments.
尽管灵长类动物中脑多巴胺神经元的反应被很好地描述为携带用于奖励预测的时间差(TD)误差信号,但现有理论并未对大脑如何追踪可能与预测未来奖励相关的过去感官事件给出可信的解释。从经验上看,先前理论的这些缺点在它们对动物暴露于事件时间变化的实验的解释中尤为明显。由于使用了一种称为抽头延迟线的表示装置,原始理论错误地预测了此类实验的结果。在这里,我们提出,通过考虑一种形式设置的TD算法,可以对历史表示有更丰富的理解,并更好地解释这些实验,该形式设置包含了多巴胺能反应理论中最初未考虑的两个特征:部分可观测性(动物的感官体验与世界的真实潜在状态之间的区别)和半马尔可夫动力学(对事件间隔变化的明确解释)。新理论将多巴胺能系统置于更丰富的功能和解剖背景中,因为它假设(与最近的皮层计算理论一致),部分可观测性和刺激历史问题是在感觉皮层中使用统计建模和推理来解决的,并且TD系统使用这种推理的结果而不是原始感官数据来预测奖励。它还解释了一系列实验数据,包括涉及编程时间变异性的实验和其他以前未建模的多巴胺能反应现象,我们认为这些现象与动物间隔计时中的主观噪声有关。最后,它提供了新的实验预测和一个丰富的理论框架,用于设计未来的实验。