• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

A3C-GS:具有锁的异步动作-评价器代理的自适应矩梯度共享。

A3C-GS: Adaptive Moment Gradient Sharing With Locks for Asynchronous Actor-Critic Agents.

出版信息

IEEE Trans Neural Netw Learn Syst. 2021 Mar;32(3):1162-1176. doi: 10.1109/TNNLS.2020.2980743. Epub 2021 Mar 1.

DOI:10.1109/TNNLS.2020.2980743
PMID:32287019
Abstract

We propose an asynchronous gradient sharing mechanism for the parallel actor-critic algorithms with improved exploration characteristics. The proposed algorithm (A3C-GS) has the property of automatically diversifying worker policies in the short term for exploration, thereby reducing the need for entropy loss terms. Despite policy diversification, the algorithm converges to the optimal policy in the long term. We show in our analysis that the gradient sharing operation is a composition of two contractions. The first contraction performs gradient computation, while the second contraction is a gradient sharing operation coordinated by locks. From these two contractions, certain short- and long-term properties result. For the short term, gradient sharing induces temporary heterogeneity in policies for performing needed exploration. In the long term, under a suitably small learning rate and gradient clipping, convergence to the optimal policy is theoretically guaranteed. We verify our results with several high-dimensional experiments and compare A3C-GS against other on-policy policy-gradient algorithms. Our proposed algorithm achieved the highest weighted score. Despite lower entropy weights, it performed well in high-dimensional environments that require exploration due to sparse rewards and those that need navigation in 3-D environments for long survival tasks. It consistently performed better than the base asynchronous advantage actor-critic (A3C) algorithm.

摘要

我们提出了一种具有改进探索特性的并行演员-评论员算法的异步梯度共享机制。所提出的算法(A3C-GS)具有在短期内自动使工作者策略多样化以进行探索的特性,从而减少了对熵损失项的需求。尽管策略多样化,但该算法从长期来看仍能收敛到最优策略。我们在分析中表明,梯度共享操作是两个收缩的组合。第一个收缩执行梯度计算,而第二个收缩是由锁协调的梯度共享操作。从这两个收缩中,会产生某些短期和长期特性。从短期来看,梯度共享会导致执行所需探索的策略暂时出现异质性。从长期来看,在适当小的学习率和梯度裁剪下,理论上可以保证收敛到最优策略。我们通过几个高维实验验证了我们的结果,并将 A3C-GS 与其他基于策略的策略梯度算法进行了比较。我们提出的算法实现了最高的加权得分。尽管熵权重较低,但在需要探索的高维环境中表现良好,这些环境由于奖励稀疏,或者需要在 3D 环境中导航以实现长时间的生存任务。它始终比基础异步优势演员-评论员(A3C)算法表现更好。

相似文献

1
A3C-GS: Adaptive Moment Gradient Sharing With Locks for Asynchronous Actor-Critic Agents.A3C-GS:具有锁的异步动作-评价器代理的自适应矩梯度共享。
IEEE Trans Neural Netw Learn Syst. 2021 Mar;32(3):1162-1176. doi: 10.1109/TNNLS.2020.2980743. Epub 2021 Mar 1.
2
Navigation in Unknown Dynamic Environments Based on Deep Reinforcement Learning.基于深度强化学习的未知动态环境导航。
Sensors (Basel). 2019 Sep 5;19(18):3837. doi: 10.3390/s19183837.
3
Actor-Critic Learning Control With Regularization and Feature Selection in Policy Gradient Estimation.策略梯度估计中具有正则化和特征选择的演员-评论家学习控制
IEEE Trans Neural Netw Learn Syst. 2021 Mar;32(3):1217-1227. doi: 10.1109/TNNLS.2020.2981377. Epub 2021 Mar 1.
4
Stochastic Integrated Actor-Critic for Deep Reinforcement Learning.用于深度强化学习的随机集成演员-评论家算法
IEEE Trans Neural Netw Learn Syst. 2024 May;35(5):6654-6666. doi: 10.1109/TNNLS.2022.3212273. Epub 2024 May 2.
5
Asynchronous learning for actor-critic neural networks and synchronous triggering for multiplayer system.异步学习的演员-批评神经网络和同步触发的多人系统。
ISA Trans. 2022 Oct;129(Pt B):295-308. doi: 10.1016/j.isatra.2022.02.007. Epub 2022 Feb 10.
6
Optimal Policy of Multiplayer Poker via Actor-Critic Reinforcement Learning.通过演员-评论家强化学习实现多人扑克的最优策略
Entropy (Basel). 2022 May 30;24(6):774. doi: 10.3390/e24060774.
7
Actor-Critic Learning Control Based on -Regularized Temporal-Difference Prediction With Gradient Correction.基于带梯度校正的正则化时间差分预测的演员-评论家学习控制
IEEE Trans Neural Netw Learn Syst. 2018 Dec;29(12):5899-5909. doi: 10.1109/TNNLS.2018.2808203. Epub 2018 Apr 5.
8
Boosting On-Policy Actor-Critic With Shallow Updates in Critic.通过在评论家网络中进行浅层更新来增强策略上的演员-评论家算法
IEEE Trans Neural Netw Learn Syst. 2025 Mar;36(3):5644-5653. doi: 10.1109/TNNLS.2024.3378913. Epub 2025 Feb 28.
9
Meta attention for Off-Policy Actor-Critic.用于离策略演员-评论家的元注意力机制
Neural Netw. 2023 Jun;163:86-96. doi: 10.1016/j.neunet.2023.03.024. Epub 2023 Mar 28.
10
Implicit incremental natural actor critic algorithm.隐式增量自然动作值函数评论家算法。
Neural Netw. 2019 Jan;109:103-112. doi: 10.1016/j.neunet.2018.10.007. Epub 2018 Oct 21.

引用本文的文献

1
Category learning in a recurrent neural network with reinforcement learning.基于强化学习的循环神经网络中的类别学习。
Front Psychiatry. 2022 Oct 25;13:1008011. doi: 10.3389/fpsyt.2022.1008011. eCollection 2022.