Istituto di Scienze e Tecnologie della Cognizione (ISTC), CNR, Via San Martino della Battaglia 44, 00185, Roma, Italy.
Neural Netw. 2013 Mar;39:40-51. doi: 10.1016/j.neunet.2012.12.012. Epub 2013 Jan 14.
An important issue of recent neuroscientific research is to understand the functional role of the phasic release of dopamine in the striatum, and in particular its relation to reinforcement learning. The literature is split between two alternative hypotheses: one considers phasic dopamine as a reward prediction error similar to the computational TD-error, whose function is to guide an animal to maximize future rewards; the other holds that phasic dopamine is a sensory prediction error signal that lets the animal discover and acquire novel actions. In this paper we propose an original hypothesis that integrates these two contrasting positions: according to our view phasic dopamine represents a TD-like reinforcement prediction error learning signal determined by both unexpected changes in the environment (temporary, intrinsic reinforcements) and biological rewards (permanent, extrinsic reinforcements). Accordingly, dopamine plays the functional role of driving both the discovery and acquisition of novel actions and the maximization of future rewards. To validate our hypothesis we perform a series of experiments with a simulated robotic system that has to learn different skills in order to get rewards. We compare different versions of the system in which we vary the composition of the learning signal. The results show that only the system reinforced by both extrinsic and intrinsic reinforcements is able to reach high performance in sufficiently complex conditions.
近期神经科学研究的一个重要问题是理解纹状体中多巴胺的相位释放的功能作用,特别是其与强化学习的关系。文献中有两种相互矛盾的假设:一种假设认为,相位多巴胺类似于计算 TD 误差的奖励预测误差,其功能是指导动物最大化未来奖励;另一种假设认为,相位多巴胺是一种感觉预测误差信号,使动物发现并获得新的动作。在本文中,我们提出了一个原始的假设,将这两种对立的观点结合起来:根据我们的观点,相位多巴胺代表了一种类似于 TD 的强化预测误差学习信号,由环境中的意外变化(临时的、内在的奖励)和生物奖励(永久的、外在的奖励)共同决定。因此,多巴胺在发现和获得新动作以及最大化未来奖励方面发挥着功能作用。为了验证我们的假设,我们使用一个模拟机器人系统进行了一系列实验,该系统必须学习不同的技能以获得奖励。我们比较了系统的不同版本,其中我们改变了学习信号的组成。结果表明,只有同时受到外在和内在奖励强化的系统才能在足够复杂的条件下达到高绩效。