Tokyo Institute of Technology, 2-12-1-W8-74, O-okayama, Tokyo, 152-8552, Japan.
Neural Netw. 2012 Feb;26:118-29. doi: 10.1016/j.neunet.2011.09.005. Epub 2011 Oct 1.
Policy gradient is a useful model-free reinforcement learning approach, but it tends to suffer from instability of gradient estimates. In this paper, we analyze and improve the stability of policy gradient methods. We first prove that the variance of gradient estimates in the PGPE (policy gradients with parameter-based exploration) method is smaller than that of the classical REINFORCE method under a mild assumption. We then derive the optimal baseline for PGPE, which contributes to further reducing the variance. We also theoretically show that PGPE with the optimal baseline is more preferable than REINFORCE with the optimal baseline in terms of the variance of gradient estimates. Finally, we demonstrate the usefulness of the improved PGPE method through experiments.
策略梯度是一种有用的无模型强化学习方法,但它往往存在梯度估计不稳定的问题。在本文中,我们分析并改进了策略梯度方法的稳定性。我们首先证明,在一个温和的假设下,PGPE(基于参数探索的策略梯度)方法中的梯度估计方差小于经典的 REINFORCE 方法。然后,我们推导出了 PGPE 的最优基准,这有助于进一步降低方差。我们还从理论上表明,在梯度估计方差方面,带最优基准的 PGPE 比带最优基准的 REINFORCE 更优。最后,我们通过实验证明了改进后的 PGPE 方法的有效性。