Osaka University, 2-1, Yamadaoka, Suita city, Osaka, Japan.
Neural Netw. 2019 Jan;109:103-112. doi: 10.1016/j.neunet.2018.10.007. Epub 2018 Oct 21.
Natural policy gradient (NPG) methods are promising approaches to finding locally optimal policy parameters. The NPG approach works well in optimizing complex policies with high-dimensional parameters, and the effectiveness of NPG methods has been demonstrated in many fields. However, the incremental estimation of the NPG is computationally unstable owing to its high sensitivity to the step-sizes values, especially to the one used to update the estimate of NPG. In this study, we propose a new incremental and stable algorithm for the NPG estimation. We call the proposed algorithm the implicit incremental natural actor critic (I2NAC), and it is based on the idea of the implicit update. The convergence analysis for I2NAC is provided. Theoretical analysis results indicate the stability of I2NAC and the instability of conventional incremental NPG methods. Numerical experiments were performed, and the results show that I2NAC is less sensitive to the values of the meta-parameters, including the step-size for the NPG update, compared to the existing incremental NPG method.
自然策略梯度(NPG)方法是寻找局部最优策略参数的一种很有前途的方法。NPG 方法在优化具有高维参数的复杂策略方面效果很好,并且其有效性已经在许多领域得到了证明。然而,由于其对步长值的高度敏感性,尤其是对用于更新 NPG 估计的步长值的敏感性,NPG 的增量估计在计算上是不稳定的。在这项研究中,我们提出了一种新的用于 NPG 估计的增量和稳定算法。我们称所提出的算法为隐式增量自然动作评论家(I2NAC),它基于隐式更新的思想。提供了对 I2NAC 的收敛性分析。理论分析结果表明了 I2NAC 的稳定性和传统增量 NPG 方法的不稳定性。进行了数值实验,结果表明,与现有的增量 NPG 方法相比,I2NAC 对元参数的值(包括 NPG 更新的步长)的敏感性较低。