Cheng Yuhu, Huang Longyang, Chen C L Philip, Wang Xuesong
IEEE Trans Neural Netw Learn Syst. 2023 Nov;34(11):9054-9063. doi: 10.1109/TNNLS.2022.3155483. Epub 2023 Oct 27.
The accurate estimation of Q-function and the enhancement of agent's exploration ability have always been challenges of off-policy actor-critic algorithms. To address the two concerns, a novel robust actor-critic (RAC) is developed in this article. We first derive a robust policy improvement mechanism (RPIM) by using the local optimal policy about the current estimated Q-function to guide policy improvement. By constraining the relative entropy between the new policy and the previous one in policy improvement, the proposed RPIM can enhance the stability of the policy update process. The theoretical analysis shows that the incentive to increase the policy entropy is endowed when the policy is updated, which is conducive to enhancing the exploration ability of agents. Then, RAC is developed by applying the proposed RPIM to regulate the actor improvement process. The developed RAC is proven to be convergent. Finally, the proposed RAC is evaluated on some continuous-action control tasks in the MuJoCo platform and the experimental results show that RAC outperforms several state-of-the-art reinforcement learning algorithms.
准确估计Q函数以及增强智能体的探索能力一直是离策略演员-评论家算法面临的挑战。为了解决这两个问题,本文提出了一种新颖的鲁棒演员-评论家(RAC)算法。我们首先通过使用关于当前估计Q函数的局部最优策略来推导鲁棒策略改进机制(RPIM),以指导策略改进。通过在策略改进过程中约束新策略与前一个策略之间的相对熵,所提出的RPIM可以提高策略更新过程的稳定性。理论分析表明,在更新策略时赋予了增加策略熵的激励,这有利于增强智能体的探索能力。然后,通过应用所提出的RPIM来调节演员改进过程,开发了RAC算法。所开发的RAC算法被证明是收敛的。最后,在MuJoCo平台上的一些连续动作控制任务上对所提出的RAC算法进行了评估,实验结果表明RAC算法优于几种先进的强化学习算法。