Zsuga Judit, Biro Klara, Tajti Gabor, Szilasi Magdolna Emma, Papp Csaba, Juhasz Bela, Gesztelyi Rudolf
Department of Health Systems Management and Quality Management for Health Care, Faculty of Public Health, University of Debrecen, Debrecen, Nagyerdei krt. 98, 4032, Hungary.
Department of Pharmacology, Faculty of Pharmacy, University of Debrecen, Debrecen, Nagyerdei krt. 98, 4032, Hungary.
BMC Neurosci. 2016 Oct 28;17(1):70. doi: 10.1186/s12868-016-0302-7.
Reinforcement learning is a fundamental form of learning that may be formalized using the Bellman equation. Accordingly an agent determines the state value as the sum of immediate reward and of the discounted value of future states. Thus the value of state is determined by agent related attributes (action set, policy, discount factor) and the agent's knowledge of the environment embodied by the reward function and hidden environmental factors given by the transition probability. The central objective of reinforcement learning is to solve these two functions outside the agent's control either using, or not using a model.
In the present paper, using the proactive model of reinforcement learning we offer insight on how the brain creates simplified representations of the environment, and how these representations are organized to support the identification of relevant stimuli and action. Furthermore, we identify neurobiological correlates of our model by suggesting that the reward and policy functions, attributes of the Bellman equitation, are built by the orbitofrontal cortex (OFC) and the anterior cingulate cortex (ACC), respectively.
Based on this we propose that the OFC assesses cue-context congruence to activate the most context frame. Furthermore given the bidirectional neuroanatomical link between the OFC and model-free structures, we suggest that model-based input is incorporated into the reward prediction error (RPE) signal, and conversely RPE signal may be used to update the reward-related information of context frames and the policy underlying action selection in the OFC and ACC, respectively. Furthermore clinical implications for cognitive behavioral interventions are discussed.
强化学习是一种基本的学习形式,可以用贝尔曼方程进行形式化。因此,智能体将状态值确定为即时奖励与未来状态折扣值之和。这样,状态值就由与智能体相关的属性(动作集、策略、折扣因子)以及智能体对由奖励函数体现的环境和由转移概率给出的隐藏环境因素的了解所决定。强化学习的核心目标是在智能体的控制之外,使用或不使用模型来求解这两个函数。
在本文中,我们使用强化学习的主动模型,深入探讨了大脑如何创建环境的简化表征,以及这些表征是如何组织起来以支持相关刺激和动作的识别。此外,我们通过表明奖励和策略函数(贝尔曼方程的属性)分别由眶额皮质(OFC)和前扣带回皮质(ACC)构建,确定了我们模型的神经生物学相关性。
基于此,我们提出眶额皮质评估线索 - 情境一致性以激活最相关的情境框架。此外,鉴于眶额皮质与无模型结构之间的双向神经解剖学联系,我们认为基于模型的输入被纳入奖励预测误差(RPE)信号,反之,RPE信号可分别用于更新眶额皮质和前扣带回皮质中情境框架的奖励相关信息以及动作选择背后的策略。此外,还讨论了对认知行为干预的临床意义。