Huang Zhenbo, Zhao Jing, Sun Shiliang
IEEE Trans Neural Netw Learn Syst. 2024 Aug 23;PP. doi: 10.1109/TNNLS.2024.3443082.
Offline reinforcement learning (RL) has been widely used in practice due to its efficient data utilization, but it still faces the challenge of training vulnerability caused by policy deviation. Existing offline RL methods that add policy constraints or perform conservative Q -value estimation are pessimistic, making the learned policy suboptimal. In this article, we address the pessimism problem by focusing on accurate Q -value estimation. We propose the de-pessimism (DEP) operator to estimate Q values using the optimal Bellman operator or the compensation operator according to whether the actions are in the behavior support set. The compensation operator qualitatively determines the positive or negative nature of out-of-distribution (OOD) actions based on their performance compared with the behavior actions. It leverages differences in state values to compensate for the Q value of positive OOD actions, thereby alleviating pessimism. We theoretically demonstrate the convergence of DEP and its effectiveness in policy improvement. To further advance the practical application, we integrate DEP into the soft actor-critic (SAC) algorithm, yielding the value-compensated de-pessimism offline RL (DoRL-VC). Experimentally, DoRL-VC achieves state-of-the-art (SOTA) performance across mujoco locomotion, Maze 2-D, and challenging Adroit tasks, illustrating the efficacy of DEP in mitigating pessimism.
离线强化学习(RL)因其高效的数据利用率而在实践中得到广泛应用,但它仍面临着因策略偏差导致的训练脆弱性挑战。现有的添加策略约束或执行保守Q值估计的离线RL方法较为悲观,使得学习到的策略次优。在本文中,我们通过关注准确的Q值估计来解决悲观问题。我们提出了去悲观(DEP)算子,根据动作是否在行为支持集中,使用最优贝尔曼算子或补偿算子来估计Q值。补偿算子根据分布外(OOD)动作与行为动作的性能比较,定性地确定其正负性质。它利用状态值的差异来补偿正OOD动作的Q值,从而减轻悲观情绪。我们从理论上证明了DEP的收敛性及其在策略改进中的有效性。为了进一步推动实际应用,我们将DEP集成到软演员-评论家(SAC)算法中,得到了价值补偿去悲观离线RL(DoRL-VC)。实验表明,DoRL-VC在mujoco运动、二维迷宫和具有挑战性的灵巧任务中均取得了最优(SOTA)性能,说明了DEP在减轻悲观情绪方面的有效性。