Department of Computer Science, University of York, York YO105DD, UK.
Neural Netw. 2010 May;23(4):541-50. doi: 10.1016/j.neunet.2010.01.001. Epub 2010 Jan 11.
Potential-based reward shaping has been shown to be a powerful method to improve the convergence rate of reinforcement learning agents. It is a flexible technique to incorporate background knowledge into temporal-difference learning in a principled way. However, the question remains of how to compute the potential function which is used to shape the reward that is given to the learning agent. In this paper, we show how, in the absence of knowledge to define the potential function manually, this function can be learned online in parallel with the actual reinforcement learning process. Two cases are considered. The first solution which is based on the multi-grid discretisation is designed for model-free reinforcement learning. In the second case, the approach for the prototypical model-based R-max algorithm is proposed. It learns the potential function using the free space assumption about the transitions in the environment. Two novel algorithms are presented and evaluated.
基于势的奖励塑造已被证明是一种提高强化学习代理收敛速度的有效方法。它是一种灵活的技术,可以以一种有原则的方式将背景知识纳入时间差分学习中。然而,问题仍然是如何计算用于塑造奖励的势函数,该奖励被给予学习代理。在本文中,我们展示了在没有知识手动定义势函数的情况下,如何在线学习该函数,与实际的强化学习过程并行。考虑了两种情况。第一种基于多网格离散化的解决方案是为无模型强化学习设计的。在第二种情况下,提出了基于原型的 R-max 算法的方法。它使用关于环境中转换的自由空间假设来学习势函数。提出并评估了两种新算法。