Szepesvári C, Littman M L
Mindmaker, Ltd., Budapest 1121, Konkoly Thege M. U. 29-33, Hungary.
Neural Comput. 1999 Nov 15;11(8):2017-59. doi: 10.1162/089976699300016070.
Reinforcement learning is the problem of generating optimal behavior in a sequential decision-making environment given the opportunity of interacting with it. Many algorithms for solving reinforcement-learning problems work by computing improved estimates of the optimal value function. We extend prior analyses of reinforcement-learning algorithms and present a powerful new theorem that can provide a unified analysis of such value-function-based reinforcement-learning algorithms. The usefulness of the theorem lies in how it allows the convergence of a complex asynchronous reinforcement-learning algorithm to be proved by verifying that a simpler synchronous algorithm converges. We illustrate the application of the theorem by analyzing the convergence of Q-learning, model-based reinforcement learning, Q-learning with multistate updates, Q-learning for Markov games, and risk-sensitive reinforcement learning.
强化学习是在给定与顺序决策环境交互机会的情况下,生成最优行为的问题。许多解决强化学习问题的算法通过计算最优值函数的改进估计来工作。我们扩展了对强化学习算法的先前分析,并提出了一个强大的新定理,该定理可以对这类基于值函数的强化学习算法进行统一分析。该定理的有用之处在于,它允许通过验证一个更简单的同步算法收敛来证明一个复杂的异步强化学习算法的收敛性。我们通过分析Q学习、基于模型的强化学习、多状态更新的Q学习、马尔可夫博弈的Q学习以及风险敏感强化学习的收敛性来说明该定理的应用。