A3C-GS：具有锁的异步动作-评价器代理的自适应矩梯度共享。

A3C-GS: Adaptive Moment Gradient Sharing With Locks for Asynchronous Actor-Critic Agents.

出版信息

IEEE Trans Neural Netw Learn Syst. 2021 Mar;32(3):1162-1176. doi: 10.1109/TNNLS.2020.2980743. Epub 2021 Mar 1.

DOI:10.1109/TNNLS.2020.2980743

Abstract

We propose an asynchronous gradient sharing mechanism for the parallel actor-critic algorithms with improved exploration characteristics. The proposed algorithm (A3C-GS) has the property of automatically diversifying worker policies in the short term for exploration, thereby reducing the need for entropy loss terms. Despite policy diversification, the algorithm converges to the optimal policy in the long term. We show in our analysis that the gradient sharing operation is a composition of two contractions. The first contraction performs gradient computation, while the second contraction is a gradient sharing operation coordinated by locks. From these two contractions, certain short- and long-term properties result. For the short term, gradient sharing induces temporary heterogeneity in policies for performing needed exploration. In the long term, under a suitably small learning rate and gradient clipping, convergence to the optimal policy is theoretically guaranteed. We verify our results with several high-dimensional experiments and compare A3C-GS against other on-policy policy-gradient algorithms. Our proposed algorithm achieved the highest weighted score. Despite lower entropy weights, it performed well in high-dimensional environments that require exploration due to sparse rewards and those that need navigation in 3-D environments for long survival tasks. It consistently performed better than the base asynchronous advantage actor-critic (A3C) algorithm.

摘要

我们提出了一种具有改进探索特性的并行演员-评论员算法的异步梯度共享机制。所提出的算法（A3C-GS）具有在短期内自动使工作者策略多样化以进行探索的特性，从而减少了对熵损失项的需求。尽管策略多样化，但该算法从长期来看仍能收敛到最优策略。我们在分析中表明，梯度共享操作是两个收缩的组合。第一个收缩执行梯度计算，而第二个收缩是由锁协调的梯度共享操作。从这两个收缩中，会产生某些短期和长期特性。从短期来看，梯度共享会导致执行所需探索的策略暂时出现异质性。从长期来看，在适当小的学习率和梯度裁剪下，理论上可以保证收敛到最优策略。我们通过几个高维实验验证了我们的结果，并将 A3C-GS 与其他基于策略的策略梯度算法进行了比较。我们提出的算法实现了最高的加权得分。尽管熵权重较低，但在需要探索的高维环境中表现良好，这些环境由于奖励稀疏，或者需要在 3D 环境中导航以实现长时间的生存任务。它始终比基础异步优势演员-评论员（A3C）算法表现更好。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

A3C-GS：具有锁的异步动作-评价器代理的自适应矩梯度共享。

A3C-GS: Adaptive Moment Gradient Sharing With Locks for Asynchronous Actor-Critic Agents.

出版信息

相似文献

引用本文的文献

A3C-GS：具有锁的异步动作-评价器代理的自适应矩梯度共享。

A3C-GS: Adaptive Moment Gradient Sharing With Locks for Asynchronous Actor-Critic Agents.

出版信息

相似文献

引用本文的文献