Department of Physiology, Development and Neuroscience, University of Cambridge, Cambridge CB2 3DY, United Kingdom.
Proc Natl Acad Sci U S A. 2024 May 14;121(20):e2316658121. doi: 10.1073/pnas.2316658121. Epub 2024 May 8.
Individual survival and evolutionary selection require biological organisms to maximize reward. Economic choice theories define the necessary and sufficient conditions, and neuronal signals of decision variables provide mechanistic explanations. Reinforcement learning (RL) formalisms use predictions, actions, and policies to maximize reward. Midbrain dopamine neurons code reward prediction errors (RPE) of subjective reward value suitable for RL. Electrical and optogenetic self-stimulation experiments demonstrate that monkeys and rodents repeat behaviors that result in dopamine excitation. Dopamine excitations reflect positive RPEs that increase reward predictions via RL; against increasing predictions, obtaining similar dopamine RPE signals again requires better rewards than before. The positive RPEs drive predictions higher again and thus advance a recursive reward-RPE-prediction iteration toward better and better rewards. Agents also avoid dopamine inhibitions that lower reward prediction via RL, which allows smaller rewards than before to elicit positive dopamine RPE signals and resume the iteration toward better rewards. In this way, dopamine RPE signals serve a causal mechanism that attracts agents via RL to the best rewards. The mechanism improves daily life and benefits evolutionary selection but may also induce restlessness and greed.
个体生存和进化选择要求生物有机体最大化奖励。经济选择理论定义了必要和充分的条件,决策变量的神经元信号提供了机制解释。强化学习 (RL) 形式主义使用预测、行动和策略来最大化奖励。中脑多巴胺神经元对主观奖励价值的奖励预测误差 (RPE) 进行编码,适合 RL。电和光遗传学自我刺激实验表明,猴子和啮齿动物会重复导致多巴胺兴奋的行为。多巴胺兴奋反映了正的 RPE,通过 RL 增加奖励预测;为了增加预测,再次获得类似的多巴胺 RPE 信号需要比以前更好的奖励。积极的 RPE 再次推动预测更高,从而推进递归奖励-RPE-预测迭代,以获得更好的奖励。代理人还避免了通过 RL 降低奖励预测的多巴胺抑制,这使得比以前更小的奖励可以引发积极的多巴胺 RPE 信号,并重新开始向更好的奖励迭代。通过这种方式,多巴胺 RPE 信号作为一种因果机制,通过 RL 吸引代理人获得最佳奖励。该机制改善了日常生活,有利于进化选择,但也可能导致不安和贪婪。