IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):7686-7695. doi: 10.1109/TPAMI.2022.3223407. Epub 2023 May 5.
Controlling a non-statically bipedal robot is challenging due to the complex dynamics and multi-criterion optimization involved. Recent works have demonstrated the effectiveness of deep reinforcement learning (DRL) for simulation and physical robots. In these methods, the rewards from different criteria are normally summed to learn a scalar function. However, a scalar is less informative and may be insufficient to derive effective information for each reward channel from the complex hybrid rewards. In this work, we propose a novel reward-adaptive reinforcement learning method for biped locomotion, allowing the control policy to be simultaneously optimized by multiple criteria using a dynamic mechanism. The proposed method applies a multi-head critic to learn a separate value function for each reward component, leading to hybrid policy gradients. We further propose dynamic weight, allowing each component to optimize the policy with different priorities. This hybrid and dynamic policy gradient (HDPG) design makes the agent learn more efficiently. We show that the proposed method outperforms summed-up-reward approaches and is able to transfer to physical robots. The MuJoCo results further demonstrate the effectiveness and generalization of HDPG.
控制非静态双足机器人具有挑战性,因为涉及到复杂的动力学和多准则优化。最近的研究表明,深度强化学习(DRL)在模拟和物理机器人方面非常有效。在这些方法中,通常将来自不同标准的奖励相加以学习标量函数。然而,标量的信息量较少,并且可能不足以从复杂的混合奖励中为每个奖励通道得出有效信息。在这项工作中,我们提出了一种用于双足运动的新的奖励自适应强化学习方法,允许控制策略使用动态机制同时通过多个标准进行优化。所提出的方法应用多头评论家来为每个奖励分量学习单独的价值函数,从而导致混合策略梯度。我们进一步提出了动态权重,允许每个分量以不同的优先级优化策略。这种混合和动态策略梯度(HDPG)设计使代理更有效地学习。我们表明,所提出的方法优于总和奖励方法,并且能够转移到物理机器人。MuJoCo 的结果进一步证明了 HDPG 的有效性和泛化能力。