Li Fan, Fu Mingsheng, Chen Wenyu, Zhang Fan, Zhang Haixian, Qu Hong, Yi Zhang
IEEE Trans Neural Netw Learn Syst. 2024 Jul;35(7):8783-8796. doi: 10.1109/TNNLS.2022.3215596. Epub 2024 Jul 8.
Deep off-policy actor-critic algorithms have been successfully applied to challenging tasks in continuous control. However, these methods typically suffer from the poor sample efficiency problem, limiting their widespread adoption in real-world domains. To mitigate this issue, we propose a novel actor-critic algorithm with weakly pessimistic value estimation and optimistic policy optimization (WPVOP) for continuous control. WPVOP integrates two key ingredients: 1) a weakly pessimistic value estimation, which compensates the pessimism of lower confidence bound in conventional value function (i.e., clipped double Q -learning) to trigger exploration in low-value state-action regions and 2) an optimistic policy optimization algorithm by sampling actions that could benefit the policy learning most toward optimal Q -values for efficient exploration. We theoretically analyze that the proposed weakly pessimistic value estimation method is lower and upper bounded, and empirically show that it could avoid extremely over-optimistic value estimates. We show that these two ideas are largely complementary, and can be fruitfully integrated to improve performance and promote sample efficiency of exploration. We evaluate WPVOP on the suite of continuous control tasks from MuJoCo, achieving state-of-the-art sample efficiency and performance.
深度离策略演员-评论家算法已成功应用于连续控制中的挑战性任务。然而,这些方法通常存在样本效率低下的问题,限制了它们在现实世界领域的广泛应用。为了缓解这一问题,我们提出了一种用于连续控制的具有弱悲观值估计和乐观策略优化(WPVOP)的新型演员-评论家算法。WPVOP整合了两个关键要素:1)弱悲观值估计,它补偿了传统值函数中较低置信界的悲观性(即裁剪双Q学习),以在低价值状态-动作区域触发探索;2)一种乐观策略优化算法,通过采样最有利于策略学习朝着最优Q值进行高效探索的动作。我们从理论上分析了所提出的弱悲观值估计方法是有上下界的,并通过实验表明它可以避免极其过度乐观的值估计。我们表明这两个想法在很大程度上是互补的,并且可以有效地整合以提高性能并促进探索的样本效率。我们在MuJoCo的连续控制任务套件上评估了WPVOP,实现了当前最优的样本效率和性能。