Zhang Tianren, Guo Shangqi, Tan Tian, Hu Xiaolin, Chen Feng
IEEE Trans Pattern Anal Mach Intell. 2023 Apr;45(4):4152-4166. doi: 10.1109/TPAMI.2022.3192418. Epub 2023 Mar 7.
Goal-conditioned Hierarchical Reinforcement Learning (HRL) is a promising approach for scaling up reinforcement learning (RL) techniques. However, it often suffers from training inefficiency as the action space of the high-level, i.e., the goal space, is large. Searching in a large goal space poses difficulty for both high-level subgoal generation and low-level policy learning. In this article, we show that this problem can be effectively alleviated by restricting the high-level action space from the whole goal space to a k-step adjacent region of the current state using an adjacency constraint. We theoretically prove that in a deterministic Markov Decision Process (MDP), the proposed adjacency constraint preserves the optimal hierarchical policy, while in a stochastic MDP the adjacency constraint induces a bounded state-value suboptimality determined by the MDP's transition structure. We further show that this constraint can be practically implemented by training an adjacency network that can discriminate between adjacent and non-adjacent subgoals. Experimental results on discrete and continuous control tasks including challenging simulated robot locomotion and manipulation tasks show that incorporating the adjacency constraint significantly boosts the performance of state-of-the-art goal-conditioned HRL approaches.
目标条件分层强化学习(HRL)是一种很有前景的扩大强化学习(RL)技术规模的方法。然而,由于高层(即目标空间)的动作空间很大,它常常存在训练效率低下的问题。在大目标空间中搜索对高层子目标生成和低层策略学习都构成了困难。在本文中,我们表明,通过使用邻接约束将高层动作空间从整个目标空间限制到当前状态的k步相邻区域,可以有效缓解这个问题。我们从理论上证明,在确定性马尔可夫决策过程(MDP)中,所提出的邻接约束保留了最优分层策略,而在随机MDP中,邻接约束会导致由MDP的转移结构决定的有界状态值次优性。我们进一步表明,这种约束可以通过训练一个能够区分相邻和非相邻子目标的邻接网络来实际实现。在离散和连续控制任务(包括具有挑战性的模拟机器人运动和操作任务)上的实验结果表明,纳入邻接约束显著提高了当前最先进的目标条件HRL方法的性能。