Research Center for Electrical and Information Technology, Department of Electrical and Information Engineering, Seoul National University of Science and Technology, Seoul 01811, Korea.
Applied Robot R&D Department, Korea Institute of Industrial Technology (KITECH), Ansan 15588, Korea.
Sensors (Basel). 2022 Sep 25;22(19):7266. doi: 10.3390/s22197266.
Reinforcement learning (RL) trains an agent by maximizing the sum of a discounted reward. Since the discount factor has a critical effect on the learning performance of the RL agent, it is important to choose the discount factor properly. When uncertainties are involved in the training, the learning performance with a constant discount factor can be limited. For the purpose of obtaining acceptable learning performance consistently, this paper proposes an adaptive rule for the discount factor based on the advantage function. Additionally, how to use the advantage function in both on-policy and off-policy algorithms is presented. To demonstrate the performance of the proposed adaptive rule, it is applied to PPO (Proximal Policy Optimization) for Tetris in order to validate the on-policy case, and to SAC (Soft Actor-Critic) for the motion planning of a robot manipulator to validate the off-policy case. In both cases, the proposed method results in a better or similar performance compared with cases using the best constant discount factors found by exhaustive search. Hence, the proposed adaptive discount factor automatically finds a discount factor that leads to comparable training performance, and that can be applied to representative deep reinforcement learning problems.
强化学习 (RL) 通过最大化折扣奖励的总和来训练代理。由于折扣因子对 RL 代理的学习性能有重要影响,因此选择合适的折扣因子非常重要。当训练中涉及不确定性时,使用固定折扣因子的学习性能可能会受到限制。为了始终获得可接受的学习性能,本文基于优势函数提出了一种折扣因子的自适应规则。此外,还介绍了如何在基于策略和离策略算法中使用优势函数。为了展示所提出的自适应规则的性能,将其应用于 Tetris 中的 PPO(近端策略优化)以验证基于策略的情况,并应用于机器人操纵器的运动规划中的 SAC(软动作-批评家)以验证离策略的情况。在这两种情况下,与使用穷尽搜索找到的最佳固定折扣因子的情况相比,所提出的方法的性能都更好或相似。因此,所提出的自适应折扣因子可以自动找到一个导致可比训练性能的折扣因子,并且可以应用于有代表性的深度强化学习问题。