Dayan Peter
Gatsby Computational Neuroscience Unit, UCL, London, WC1N 3AR, UK.
Network. 2009;20(1):32-46. doi: 10.1080/09548980902759086.
A striking recent finding is that monkeys behave maladaptively in a class of tasks in which they know that reward is going to be systematically delayed. This may be explained by a malign Pavlovian influence arising from states with low predicted values. However, by very carefully analyzing behavioral data from such tasks, La Camera and Richmond (2008) observed the additional important characteristic that subjects perform differently on states in the task that are at equal distances from the future reward, depending on what has happened in the recent past. The authors pointed out that this violates the definition of state value in the standard reinforcement learning models that are ubiquitous as accounts of operant and classical conditioned behavior; they suggested and analyzed an alternative temporal difference (TD) model in which past and future are melded. Here, we show that, in fact, a standard TD model can actually exhibit the same behavior, and that this avoids deleterious consequences for choice. At the heart of the model is the average reward per step, which acts as a baseline for measuring immediate rewards. Relatively subtle changes to this baseline occasioned by the past can markedly influence predictions and thus behavior.
最近一个引人注目的发现是,猴子在一类任务中表现出适应不良,在这类任务中它们知道奖励将会被系统性延迟。这可能是由预测值较低的状态产生的有害巴甫洛夫影响所解释的。然而,通过非常仔细地分析此类任务的行为数据,拉卡梅拉和里士满(2008年)观察到了另一个重要特征,即根据近期发生的情况,受试者在任务中与未来奖励距离相等的状态下表现不同。作者指出,这违反了作为操作性和经典条件性行为解释而普遍存在的标准强化学习模型中状态价值的定义;他们提出并分析了一种替代的时间差分(TD)模型,其中过去和未来相互融合。在这里,我们表明,事实上,一个标准的TD模型实际上可以表现出相同的行为,并且这避免了对选择产生有害后果。该模型的核心是每一步的平均奖励,它作为衡量即时奖励的基线。过去对这个基线产生的相对细微的变化可以显著影响预测,进而影响行为。