Bogacz Rafal, McClure Samuel M, Li Jian, Cohen Jonathan D, Montague P Read
Center for the Study of Brain, Mind and Behavior, Princeton University, Princeton, NJ 08544, USA.
Brain Res. 2007 Jun 11;1153:111-21. doi: 10.1016/j.brainres.2007.03.057. Epub 2007 Mar 24.
Recent experimental and theoretical work on reinforcement learning has shed light on the neural bases of learning from rewards and punishments. One fundamental problem in reinforcement learning is the credit assignment problem, or how to properly assign credit to actions that lead to reward or punishment following a delay. Temporal difference learning solves this problem, but its efficiency can be significantly improved by the addition of eligibility traces (ET). In essence, ETs function as decaying memories of previous choices that are used to scale synaptic weight changes. It has been shown in theoretical studies that ETs spanning a number of actions may improve the performance of reinforcement learning. However, it remains an open question whether including ETs that persist over sequences of actions allows reinforcement learning models to better fit empirical data regarding the behaviors of humans and other animals. Here, we report an experiment in which human subjects performed a sequential economic decision game in which the long-term optimal strategy differed from the strategy that leads to the greatest short-term return. We demonstrate that human subjects' performance in the task is significantly affected by the time between choices in a surprising and seemingly counterintuitive way. However, this behavior is naturally explained by a temporal difference learning model which includes ETs persisting across actions. Furthermore, we review recent findings that suggest that short-term synaptic plasticity in dopamine neurons may provide a realistic biophysical mechanism for producing ETs that persist on a timescale consistent with behavioral observations.
近期关于强化学习的实验和理论研究揭示了从奖励和惩罚中学习的神经基础。强化学习中的一个基本问题是信用分配问题,即如何正确地将信用分配给在延迟后导致奖励或惩罚的行为。时间差分学习解决了这个问题,但通过添加资格迹线(ET)可以显著提高其效率。本质上,资格迹线起到了对先前选择的衰减记忆的作用,用于缩放突触权重变化。理论研究表明,跨越多个行为的资格迹线可能会提高强化学习的性能。然而,包含在一系列行为中持续存在的资格迹线是否能使强化学习模型更好地拟合关于人类和其他动物行为的实证数据,这仍然是一个悬而未决的问题。在这里,我们报告了一项实验,其中人类受试者进行了一个顺序经济决策游戏,在这个游戏中,长期最优策略与导致最大短期回报的策略不同。我们证明,人类受试者在任务中的表现受到选择之间时间的显著影响,这种影响方式令人惊讶且看似违反直觉。然而,这种行为可以通过一个包含跨行为持续存在的资格迹线的时间差分学习模型自然地解释。此外,我们回顾了最近的研究结果,这些结果表明多巴胺神经元中的短期突触可塑性可能为产生与行为观察一致的时间尺度上持续存在的资格迹线提供一种现实的生物物理机制。