Stringer S M, Rolls E T, Taylor P
Oxford University, Centre for Computational Neuroscience, Department of Experimental Psychology, South Parks Road, Oxford OX1 3UD, United Kingdom.
Neural Netw. 2007 Mar;20(2):172-81. doi: 10.1016/j.neunet.2006.01.016. Epub 2006 May 15.
A key problem in reinforcement learning is how an animal is able to learn a sequence of movements when the reward signal only occurs at the end of the sequence. We describe how a hierarchical dynamical model of motor function is able to solve the problem of delayed reward in learning movement sequences using associative (Hebbian) learning. At the lowest level, the motor system encodes simple movements or primitives, while at higher levels the system encodes sequences of primitives. During training, the network is able to learn a high level motor program composed of a specific temporal sequence of motor primitives. The network is able to achieve this despite the fact that the reward signal, which indicates whether or not the desired motor program has been performed correctly, is received only at the end of each trial during learning. Use of a continuous attractor network in the architecture enables the network to generate the motor outputs required to produce the continuous movements necessary to implement the motor sequence.
强化学习中的一个关键问题是,当奖励信号仅在序列末尾出现时,动物如何能够学习一系列动作。我们描述了一个运动功能的分层动态模型如何能够使用联想(赫布式)学习来解决学习动作序列时的延迟奖励问题。在最低层次,运动系统对简单动作或基元进行编码,而在较高层次,系统对基元序列进行编码。在训练过程中,网络能够学习由特定时间序列的运动基元组成的高级运动程序。尽管奖励信号(表明所需的运动程序是否已正确执行)仅在学习期间每次试验结束时才收到,但网络仍能够做到这一点。在架构中使用连续吸引子网络使网络能够生成实现运动序列所需的连续运动所需的运动输出。