Millidge Beren, Song Yuhang, Lak Armin, Walton Mark E, Bogacz Rafal
MRC Brain Network Dynamics Unit, University of Oxford, Oxford, United Kingdom.
Department of Physiology, Anatomy & Genetics, University of Oxford, Oxford, United Kingdom.
PLoS Comput Biol. 2024 Nov 19;20(11):e1012580. doi: 10.1371/journal.pcbi.1012580. eCollection 2024 Nov.
Animals can adapt their preferences for different types of reward according to physiological state, such as hunger or thirst. To explain this ability, we employ a simple multi-objective reinforcement learning model that learns multiple values according to different reward dimensions such as food or water. We show that by weighting these learned values according to the current needs, behaviour may be flexibly adapted to present preferences. This model predicts that individual dopamine neurons should encode the errors associated with some reward dimensions more than with others. To provide a preliminary test of this prediction, we reanalysed a small dataset obtained from a single primate in an experiment which to our knowledge is the only published study where the responses of dopamine neurons to stimuli predicting distinct types of rewards were recorded. We observed that in addition to subjective economic value, dopamine neurons encode a gradient of reward dimensions; some neurons respond most to stimuli predicting food rewards while the others respond more to stimuli predicting fluids. We also proposed a possible implementation of the model in the basal ganglia network, and demonstrated how the striatal system can learn values in multiple dimensions, even when dopamine neurons encode mixtures of prediction error from different dimensions. Additionally, the model reproduces the instant generalisation to new physiological states seen in dopamine responses and in behaviour. Our results demonstrate how a simple neural circuit can flexibly guide behaviour according to animals' needs.
动物可以根据生理状态(如饥饿或口渴)来调整对不同类型奖励的偏好。为了解释这种能力,我们采用了一个简单的多目标强化学习模型,该模型根据食物或水等不同奖励维度学习多个值。我们表明,通过根据当前需求对这些学习到的值进行加权,行为可以灵活地适应当前的偏好。该模型预测,单个多巴胺神经元应该更多地编码与某些奖励维度相关的误差,而不是其他维度的误差。为了对这一预测进行初步测试,我们重新分析了一个从一只灵长类动物实验中获得的小数据集,据我们所知,这是唯一一项已发表的研究,其中记录了多巴胺神经元对预测不同类型奖励的刺激的反应。我们观察到,除了主观经济价值外,多巴胺神经元还编码了一个奖励维度的梯度;一些神经元对预测食物奖励的刺激反应最大,而另一些神经元对预测液体奖励的刺激反应更大。我们还提出了该模型在基底神经节网络中的一种可能实现方式,并展示了纹状体系统如何在多个维度学习值,即使多巴胺神经元编码来自不同维度的预测误差混合。此外,该模型再现了多巴胺反应和行为中出现的对新生理状态的即时泛化。我们的结果证明了一个简单的神经回路如何根据动物的需求灵活地指导行为。