Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford, United Kingdom.
Department of Experimental Psychology, University of Oxford, Oxford, United Kingdom.
PLoS Comput Biol. 2022 May 27;18(5):e1009816. doi: 10.1371/journal.pcbi.1009816. eCollection 2022 May.
To accurately predict rewards associated with states or actions, the variability of observations has to be taken into account. In particular, when the observations are noisy, the individual rewards should have less influence on tracking of average reward, and the estimate of the mean reward should be updated to a smaller extent after each observation. However, it is not known how the magnitude of the observation noise might be tracked and used to control prediction updates in the brain reward system. Here, we introduce a new model that uses simple, tractable learning rules that track the mean and standard deviation of reward, and leverages prediction errors scaled by uncertainty as the central feedback signal. We show that the new model has an advantage over conventional reinforcement learning models in a value tracking task, and approaches a theoretic limit of performance provided by the Kalman filter. Further, we propose a possible biological implementation of the model in the basal ganglia circuit. In the proposed network, dopaminergic neurons encode reward prediction errors scaled by standard deviation of rewards. We show that such scaling may arise if the striatal neurons learn the standard deviation of rewards and modulate the activity of dopaminergic neurons. The model is consistent with experimental findings concerning dopamine prediction error scaling relative to reward magnitude, and with many features of striatal plasticity. Our results span across the levels of implementation, algorithm, and computation, and might have important implications for understanding the dopaminergic prediction error signal and its relation to adaptive and effective learning.
为了准确预测与状态或动作相关的奖励,必须考虑到观测值的可变性。特别是,当观测值存在噪声时,个体奖励对平均奖励的跟踪的影响应该较小,并且每次观测后,对平均奖励的估计应该以较小的程度进行更新。然而,目前尚不清楚如何跟踪观测噪声的大小,并在大脑奖励系统中用于控制预测更新。在这里,我们引入了一种新的模型,该模型使用简单、可处理的学习规则来跟踪奖励的平均值和标准差,并利用由不确定性缩放的预测误差作为中央反馈信号。我们表明,新模型在价值跟踪任务中优于传统的强化学习模型,并接近卡尔曼滤波器提供的性能理论极限。此外,我们提出了基底神经节回路中模型的一种可能的生物学实现。在提出的网络中,多巴胺能神经元编码由奖励标准差缩放的奖励预测误差。我们表明,如果纹状体神经元学习奖励的标准差并调节多巴胺能神经元的活动,则可能会出现这种缩放。该模型与关于多巴胺预测误差相对于奖励幅度的缩放的实验结果一致,并且与纹状体可塑性的许多特征一致。我们的结果跨越了实现、算法和计算的各个层面,对于理解多巴胺能预测误差信号及其与适应性和有效学习的关系可能具有重要意义。