Brain & Mind Research Institute, University of Sydney, Camperdown, NSW 2050, Australia.
Eur J Neurosci. 2012 Apr;35(7):1036-51. doi: 10.1111/j.1460-9568.2012.08050.x.
It is now widely accepted that instrumental actions can be either goal-directed or habitual; whereas the former are rapidly acquired and regulated by their outcome, the latter are reflexive, elicited by antecedent stimuli rather than their consequences. Model-based reinforcement learning (RL) provides an elegant description of goal-directed action. Through exposure to states, actions and rewards, the agent rapidly constructs a model of the world and can choose an appropriate action based on quite abstract changes in environmental and evaluative demands. This model is powerful but has a problem explaining the development of habitual actions. To account for habits, theorists have argued that another action controller is required, called model-free RL, that does not form a model of the world but rather caches action values within states allowing a state to select an action based on its reward history rather than its consequences. Nevertheless, there are persistent problems with important predictions from the model; most notably the failure of model-free RL correctly to predict the insensitivity of habitual actions to changes in the action-reward contingency. Here, we suggest that introducing model-free RL in instrumental conditioning is unnecessary, and demonstrate that reconceptualizing habits as action sequences allows model-based RL to be applied to both goal-directed and habitual actions in a manner consistent with what real animals do. This approach has significant implications for the way habits are currently investigated and generates new experimental predictions.
现在人们普遍认为,工具性行为既可以是目标导向的,也可以是习惯性的;前者是通过其结果快速习得和调节的,而后者是反射性的,由先前的刺激引起,而不是由其后果引起的。基于模型的强化学习(RL)为目标导向的行为提供了一个优雅的描述。通过暴露于状态、动作和奖励,代理可以快速构建一个世界模型,并可以根据环境和评价需求的相当抽象的变化来选择适当的动作。这个模型很强大,但它有一个问题,无法解释习惯性动作的发展。为了解释习惯,理论家们认为需要另一个称为无模型 RL 的动作控制器,它不构建世界模型,而是在状态中缓存动作值,允许状态根据其奖励历史而不是其后果来选择动作。然而,从模型中得出的一些重要预测仍然存在问题;最显著的是,无模型 RL 未能正确预测习惯性动作对动作-奖励关联性变化的不敏感性。在这里,我们认为在工具性条件作用中引入无模型 RL 是不必要的,并证明将习惯重新概念化为动作序列可以使基于模型的 RL 以与真实动物一致的方式应用于目标导向和习惯性动作。这种方法对习惯的研究方式具有重要意义,并产生了新的实验预测。