Baddeley Bart
Centre for Computational Neuroscience and Robotics, Department of Informatics, University of Sussex, Brighton, UK.
IEEE Trans Syst Man Cybern B Cybern. 2008 Aug;38(4):950-6. doi: 10.1109/TSMCB.2008.921000.
Many interesting problems in reinforcement learning (RL) are continuous and/or high dimensional, and in this instance, RL techniques require the use of function approximators for learning value functions and policies. Often, local linear models have been preferred over distributed nonlinear models for function approximation in RL. We suggest that one reason for the difficulties encountered when using distributed architectures in RL is the problem of negative interference, whereby learning of new data disrupts previously learned mappings. The continuous temporal difference (TD) learning algorithm TD(lambda) was used to learn a value function in a limited-torque pendulum swing-up task using a multilayer perceptron (MLP) network. Three different approaches were examined for learning in the MLP networks; 1) simple gradient descent; 2) vario-eta; and 3) a pseudopattern rehearsal strategy that attempts to reduce the effects of interference. Our results show that MLP networks can be used for value function approximation in this task but require long training times. We also found that vario-eta destabilized learning and resulted in a failure of the learning process to converge. Finally, we showed that the pseudopattern rehearsal strategy drastically improved the speed of learning. The results indicate that interference is a greater problem than ill conditioning for this task.
强化学习(RL)中的许多有趣问题都是连续的和/或高维的,在这种情况下,RL技术需要使用函数逼近器来学习价值函数和策略。通常,在RL的函数逼近中,局部线性模型比分布式非线性模型更受青睐。我们认为,在RL中使用分布式架构时遇到困难的一个原因是负干扰问题,即新数据的学习会破坏先前学习的映射。使用连续时间差分(TD)学习算法TD(λ),通过多层感知器(MLP)网络在有限扭矩摆起任务中学习价值函数。研究了在MLP网络中进行学习的三种不同方法:1)简单梯度下降;2)变η;3)一种试图减少干扰影响的伪模式排练策略。我们的结果表明,MLP网络可用于此任务中的价值函数逼近,但需要较长的训练时间。我们还发现变η会使学习不稳定,并导致学习过程无法收敛。最后,我们表明伪模式排练策略极大地提高了学习速度。结果表明,对于此任务,干扰比病态条件是一个更大的问题。