Suppr超能文献

基于相对熵调节策略的稳健演员-评论家算法

Robust Actor-Critic With Relative Entropy Regulating Actor.

作者信息

Cheng Yuhu, Huang Longyang, Chen C L Philip, Wang Xuesong

出版信息

IEEE Trans Neural Netw Learn Syst. 2023 Nov;34(11):9054-9063. doi: 10.1109/TNNLS.2022.3155483. Epub 2023 Oct 27.

Abstract

The accurate estimation of Q-function and the enhancement of agent's exploration ability have always been challenges of off-policy actor-critic algorithms. To address the two concerns, a novel robust actor-critic (RAC) is developed in this article. We first derive a robust policy improvement mechanism (RPIM) by using the local optimal policy about the current estimated Q-function to guide policy improvement. By constraining the relative entropy between the new policy and the previous one in policy improvement, the proposed RPIM can enhance the stability of the policy update process. The theoretical analysis shows that the incentive to increase the policy entropy is endowed when the policy is updated, which is conducive to enhancing the exploration ability of agents. Then, RAC is developed by applying the proposed RPIM to regulate the actor improvement process. The developed RAC is proven to be convergent. Finally, the proposed RAC is evaluated on some continuous-action control tasks in the MuJoCo platform and the experimental results show that RAC outperforms several state-of-the-art reinforcement learning algorithms.

摘要

准确估计Q函数以及增强智能体的探索能力一直是离策略演员-评论家算法面临的挑战。为了解决这两个问题,本文提出了一种新颖的鲁棒演员-评论家(RAC)算法。我们首先通过使用关于当前估计Q函数的局部最优策略来推导鲁棒策略改进机制(RPIM),以指导策略改进。通过在策略改进过程中约束新策略与前一个策略之间的相对熵,所提出的RPIM可以提高策略更新过程的稳定性。理论分析表明,在更新策略时赋予了增加策略熵的激励,这有利于增强智能体的探索能力。然后,通过应用所提出的RPIM来调节演员改进过程,开发了RAC算法。所开发的RAC算法被证明是收敛的。最后,在MuJoCo平台上的一些连续动作控制任务上对所提出的RAC算法进行了评估,实验结果表明RAC算法优于几种先进的强化学习算法。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验