Saeb Sohrab, Weber Cornelius, Triesch Jochen
Frankfurt Institute for Advanced Studies, Goethe University, Frankfurt am Main, Germany.
Neural Netw. 2009 Jul-Aug;22(5-6):586-92. doi: 10.1016/j.neunet.2009.06.049. Epub 2009 Jul 8.
The brain is able to perform actions based on an adequate internal representation of the world, where task-irrelevant features are ignored and incomplete sensory data are estimated. Traditionally, it is assumed that such abstract state representations are obtained purely from the statistics of sensory input for example by unsupervised learning methods. However, more recent findings suggest an influence of the dopaminergic system, which can be modeled by a reinforcement learning approach. Standard reinforcement learning algorithms act on a single layer network connecting the state space to the action space. Here, we involve in a feature detection stage and a memory layer, which together, construct the state space for a learning agent. The memory layer consists of the state activation at the previous time step as well as the previously chosen action. We present a temporal difference based learning rule for training the weights from these additional inputs to the state layer. As a result, the performance of the network is maintained both, in the presence of task-irrelevant features, and at randomly occurring time steps during which the input is invisible. Interestingly, a goal-directed forward model emerges from the memory weights, which only covers the state-action pairs that are relevant to the task. The model presents a link between reinforcement learning, feature detection and forward models and may help to explain how reward systems recruit cortical circuits for goal-directed feature detection and prediction.
大脑能够基于对世界的适当内部表征来执行动作,其中与任务无关的特征被忽略,不完整的感官数据也能得到估计。传统上,人们认为这种抽象的状态表征纯粹是从感官输入的统计数据中获得的,例如通过无监督学习方法。然而,最近的研究结果表明多巴胺能系统具有影响,这可以通过强化学习方法进行建模。标准的强化学习算法作用于连接状态空间和动作空间的单层网络。在这里,我们引入了一个特征检测阶段和一个记忆层,它们共同为学习智能体构建状态空间。记忆层由上一个时间步的状态激活以及之前选择的动作组成。我们提出了一种基于时间差分的学习规则,用于训练从这些额外输入到状态层的权重。结果,在存在与任务无关的特征时,以及在输入不可见的随机出现的时间步中,网络的性能都能得以维持。有趣的是,一个目标导向的前向模型从记忆权重中出现,它只涵盖与任务相关的状态-动作对。该模型展示了强化学习、特征检测和前向模型之间的联系,可能有助于解释奖励系统如何招募皮层回路进行目标导向的特征检测和预测。