Department of Behavioral Neuroscience, Oregon Health and Science University, Portland, Oregon 97239-3098, and
Laboratory of Neuropsychology, National Institute of Mental Health, National Institutes of Health, Bethesda, Maryland 20892-4415.
J Neurosci. 2020 Mar 18;40(12):2553-2561. doi: 10.1523/JNEUROSCI.2355-19.2020. Epub 2020 Feb 14.
Reinforcement learning (RL) refers to the behavioral process of learning to obtain reward and avoid punishment. An important component of RL is managing explore-exploit tradeoffs, which refers to the problem of choosing between exploiting options with known values and exploring unfamiliar options. We examined correlates of this tradeoff, as well as other RL related variables, in orbitofrontal cortex (OFC) while three male monkeys performed a three-armed bandit learning task. During the task, novel choice options periodically replaced familiar options. The values of the novel options were unknown, and the monkeys had to explore them to see if they were better than other currently available options. The identity of the chosen stimulus and the reward outcome were strongly encoded in the responses of single OFC neurons. These two variables define the states and state transitions in our model that are relevant to decision-making. The chosen value of the option and the relative value of exploring that option were encoded at intermediate levels. We also found that OFC value coding was stimulus specific, as opposed to coding value independent of the identity of the option. The location of the option and the value of the current environment were encoded at low levels. Therefore, we found encoding of the variables relevant to learning and managing explore-exploit tradeoffs in OFC. These results are consistent with findings in the ventral striatum and amygdala and show that this monosynaptically connected network plays an important role in learning based on the immediate and future consequences of choices. Orbitofrontal cortex (OFC) has been implicated in representing the expected values of choices. Here we extend these results and show that OFC also encodes information relevant to managing explore-exploit tradeoffs. Specifically, OFC encodes an exploration bonus, which characterizes the relative value of exploring novel choice options. OFC also strongly encodes the identity of the chosen stimulus, and reward outcomes, which are necessary for computing the value of novel and familiar options.
强化学习(RL)是指学习获得奖励和避免惩罚的行为过程。RL 的一个重要组成部分是管理探索-利用权衡,这是指在利用具有已知价值的选项和探索不熟悉的选项之间进行选择的问题。我们在三只雄性猴子执行三臂赌博学习任务时,检查了眶额皮层(OFC)中这种权衡的相关性,以及其他与 RL 相关的变量。在任务期间,新的选择选项定期替换熟悉的选项。新选项的价值是未知的,猴子必须探索它们,看看它们是否比其他当前可用的选项更好。选择的刺激和奖励结果在单个 OFC 神经元的反应中被强烈编码。这两个变量定义了我们模型中与决策相关的状态和状态转换。所选选项的价值和探索该选项的相对价值在中间水平上被编码。我们还发现,OFC 的价值编码是特定于刺激的,而不是独立于选项身份的编码。选项的位置和当前环境的价值在低水平编码。因此,我们发现 OFC 中编码了与学习和管理探索-利用权衡相关的变量。这些结果与腹侧纹状体和杏仁核的发现一致,并表明这个单突触连接的网络在基于选择的即时和未来后果的学习中起着重要作用。眶额皮层(OFC)被认为代表了选择的预期价值。在这里,我们扩展了这些结果,并表明 OFC 还编码了与管理探索-利用权衡相关的信息。具体来说,OFC 编码了探索奖金,它描述了探索新选择选项的相对价值。OFC 还强烈编码了所选刺激的身份和奖励结果,这对于计算新的和熟悉的选项的价值是必要的。