IEEE Trans Cybern. 2022 Sep;52(9):9428-9438. doi: 10.1109/TCYB.2021.3051456. Epub 2022 Aug 18.
In recent years, the proximal policy optimization (PPO) algorithm has received considerable attention because of its excellent performance in many challenging tasks. However, there is still a large space for theoretical explanation of the mechanism of PPO's horizontal clipping operation, which is a key means to improve the performance of PPO. In addition, while PPO is inspired by the learning theory of trust region policy optimization (TRPO), the theoretical connection between PPO's clipping operation and TRPO's trust region constraint has not been well studied. In this article, we first analyze the effect of PPO's clipping operation on the objective function of conservative policy iteration, and strictly give the theoretical relationship between PPO and TRPO. Then, a novel first-order policy gradient algorithm called authentic boundary PPO (ABPPO) is proposed, which is based on the authentic boundary setting rule. To ensure the difference between the new and old policies is better kept within the clipping range, by borrowing the idea of ABPPO, we proposed two novel improved PPO algorithms called rollback mechanism-based ABPPO (RMABPPO) and penalized point policy difference-based ABPPO (P3DABPPO), which are based on the ideas of rollback clipping and penalized point policy difference, respectively. Experiments on the continuous robotic control tasks implemented in MuJoCo show that our proposed improved PPO algorithms can effectively improve the learning stability and accelerate the learning speed compared with the original PPO.
近年来,近端策略优化(PPO)算法因其在许多挑战性任务中的出色表现而受到广泛关注。然而,对于 PPO 的水平裁剪操作的机制,仍然有很大的理论解释空间,这是提高 PPO 性能的关键手段。此外,虽然 PPO 是受到信任区域策略优化(TRPO)的学习理论启发的,但 PPO 的裁剪操作和 TRPO 的信任区域约束之间的理论联系还没有得到很好的研究。在本文中,我们首先分析了 PPO 的裁剪操作对保守策略迭代目标函数的影响,并严格给出了 PPO 和 TRPO 之间的理论关系。然后,我们提出了一种新的一阶策略梯度算法,称为真实边界 PPO(ABPPO),它基于真实边界设置规则。为了确保新老策略之间的差异更好地保持在裁剪范围内,通过借鉴 ABPPO 的思想,我们提出了两种新的改进 PPO 算法,称为基于回滚机制的 ABPPO(RMABPPO)和基于惩罚点策略差异的 ABPPO(P3DABPPO),它们分别基于回滚裁剪和惩罚点策略差异的思想。在 MuJoCo 中实现的连续机器人控制任务上的实验表明,与原始 PPO 相比,我们提出的改进 PPO 算法可以有效地提高学习稳定性并加速学习速度。