Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara, Japan.
Graduate School of Information Science, Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, Nara, Japan.
Neural Netw. 2017 Oct;94:13-23. doi: 10.1016/j.neunet.2017.06.007. Epub 2017 Jun 29.
We propose a new value function approach for model-free reinforcement learning in Markov decision processes involving high dimensional states that addresses the issues of brittleness and intractable computational complexity, therefore rendering the value function approach based reinforcement learning algorithms applicable to high dimensional systems. Our new algorithm, Kernel Dynamic Policy Programming (KDPP) smoothly updates the value function in accordance to the Kullback-Leibler divergence between current and updated policies. Stabilizing the learning in this manner enables the application of the kernel trick to value function approximation, which greatly reduces computational requirements for learning in high dimensional state spaces. The performance of KDPP against other kernel trick based value function approaches is first investigated in a simulated n DOF manipulator reaching task, where only KDPP efficiently learned a viable policy at n=40. As an application to a real world high dimensional robot system, KDPP successfully learned the task of unscrewing a bottle cap via a Pneumatic Artificial Muscle (PAM) driven robotic hand with tactile sensors; a system with a state space of 32 dimensions, while given limited samples and with ordinary computing resources.
我们提出了一种新的无模型强化学习价值函数方法,用于涉及高维状态的马尔可夫决策过程,解决了脆性和难以计算的计算复杂度问题,从而使基于价值函数的强化学习算法适用于高维系统。我们的新算法,核动态策略规划(KDPP),根据当前策略和更新策略之间的 Kullback-Leibler 散度,平滑地更新价值函数。以这种方式稳定学习,使得核技巧可以应用于价值函数逼近,这大大降低了在高维状态空间中学习的计算要求。KDPP 与其他基于核技巧的价值函数方法的性能在模拟的 n DOF 机械臂到达任务中进行了首次比较,仅 KDPP 在 n=40 时有效地学习了可行的策略。作为对真实世界高维机器人系统的应用,KDPP 成功地学习了通过气动人工肌肉(PAM)驱动的具有触觉传感器的机器人手拧开瓶盖的任务;该系统的状态空间为 32 维,在样本有限且使用普通计算资源的情况下。