DeepMind, London, UK.
Max Planck UCL Centre for Computational Psychiatry and Ageing Research, University College London, London, UK.
Nature. 2020 Jan;577(7792):671-675. doi: 10.1038/s41586-019-1924-6. Epub 2020 Jan 15.
Since its introduction, the reward prediction error theory of dopamine has explained a wealth of empirical phenomena, providing a unifying framework for understanding the representation of reward and value in the brain. According to the now canonical theory, reward predictions are represented as a single scalar quantity, which supports learning about the expectation, or mean, of stochastic outcomes. Here we propose an account of dopamine-based reinforcement learning inspired by recent artificial intelligence research on distributional reinforcement learning. We hypothesized that the brain represents possible future rewards not as a single mean, but instead as a probability distribution, effectively representing multiple future outcomes simultaneously and in parallel. This idea implies a set of empirical predictions, which we tested using single-unit recordings from mouse ventral tegmental area. Our findings provide strong evidence for a neural realization of distributional reinforcement learning.
自提出以来,多巴胺的奖励预测误差理论解释了大量的经验现象,为理解大脑中奖励和价值的表示提供了一个统一的框架。根据现在的规范理论,奖励预测被表示为一个单一的标量量,支持对随机结果的期望或均值的学习。在这里,我们提出了一种基于多巴胺的强化学习的解释,这是受最近关于分布强化学习的人工智能研究的启发。我们假设大脑不是以单一的平均值,而是以概率分布的形式来表示可能的未来奖励,有效地同时并行地表示多个未来结果。这个想法暗示了一系列经验预测,我们使用来自小鼠腹侧被盖区的单个单元记录来测试这些预测。我们的发现为分布强化学习的神经实现提供了强有力的证据。