• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过弱悲观值估计和乐观策略优化改进演员-评论家算法中的探索

Improving Exploration in Actor-Critic With Weakly Pessimistic Value Estimation and Optimistic Policy Optimization.

作者信息

Li Fan, Fu Mingsheng, Chen Wenyu, Zhang Fan, Zhang Haixian, Qu Hong, Yi Zhang

出版信息

IEEE Trans Neural Netw Learn Syst. 2024 Jul;35(7):8783-8796. doi: 10.1109/TNNLS.2022.3215596. Epub 2024 Jul 8.

DOI:10.1109/TNNLS.2022.3215596
PMID:36306289
Abstract

Deep off-policy actor-critic algorithms have been successfully applied to challenging tasks in continuous control. However, these methods typically suffer from the poor sample efficiency problem, limiting their widespread adoption in real-world domains. To mitigate this issue, we propose a novel actor-critic algorithm with weakly pessimistic value estimation and optimistic policy optimization (WPVOP) for continuous control. WPVOP integrates two key ingredients: 1) a weakly pessimistic value estimation, which compensates the pessimism of lower confidence bound in conventional value function (i.e., clipped double Q -learning) to trigger exploration in low-value state-action regions and 2) an optimistic policy optimization algorithm by sampling actions that could benefit the policy learning most toward optimal Q -values for efficient exploration. We theoretically analyze that the proposed weakly pessimistic value estimation method is lower and upper bounded, and empirically show that it could avoid extremely over-optimistic value estimates. We show that these two ideas are largely complementary, and can be fruitfully integrated to improve performance and promote sample efficiency of exploration. We evaluate WPVOP on the suite of continuous control tasks from MuJoCo, achieving state-of-the-art sample efficiency and performance.

摘要

深度离策略演员-评论家算法已成功应用于连续控制中的挑战性任务。然而,这些方法通常存在样本效率低下的问题,限制了它们在现实世界领域的广泛应用。为了缓解这一问题,我们提出了一种用于连续控制的具有弱悲观值估计和乐观策略优化(WPVOP)的新型演员-评论家算法。WPVOP整合了两个关键要素:1)弱悲观值估计,它补偿了传统值函数中较低置信界的悲观性(即裁剪双Q学习),以在低价值状态-动作区域触发探索;2)一种乐观策略优化算法,通过采样最有利于策略学习朝着最优Q值进行高效探索的动作。我们从理论上分析了所提出的弱悲观值估计方法是有上下界的,并通过实验表明它可以避免极其过度乐观的值估计。我们表明这两个想法在很大程度上是互补的,并且可以有效地整合以提高性能并促进探索的样本效率。我们在MuJoCo的连续控制任务套件上评估了WPVOP,实现了当前最优的样本效率和性能。

相似文献

1
Improving Exploration in Actor-Critic With Weakly Pessimistic Value Estimation and Optimistic Policy Optimization.通过弱悲观值估计和乐观策略优化改进演员-评论家算法中的探索
IEEE Trans Neural Netw Learn Syst. 2024 Jul;35(7):8783-8796. doi: 10.1109/TNNLS.2022.3215596. Epub 2024 Jul 8.
2
Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors.分布软演员-评论家:用于解决价值估计误差的离策略强化学习
IEEE Trans Neural Netw Learn Syst. 2022 Nov;33(11):6584-6598. doi: 10.1109/TNNLS.2021.3082568. Epub 2022 Oct 27.
3
De-Pessimism Offline Reinforcement Learning via Value Compensation.通过价值补偿实现的离线强化学习去悲观化
IEEE Trans Neural Netw Learn Syst. 2024 Aug 23;PP. doi: 10.1109/TNNLS.2024.3443082.
4
Robust Actor-Critic With Relative Entropy Regulating Actor.基于相对熵调节策略的稳健演员-评论家算法
IEEE Trans Neural Netw Learn Syst. 2023 Nov;34(11):9054-9063. doi: 10.1109/TNNLS.2022.3155483. Epub 2023 Oct 27.
5
Stochastic Integrated Actor-Critic for Deep Reinforcement Learning.用于深度强化学习的随机集成演员-评论家算法
IEEE Trans Neural Netw Learn Syst. 2024 May;35(5):6654-6666. doi: 10.1109/TNNLS.2022.3212273. Epub 2024 May 2.
6
Mild Policy Evaluation for Offline Actor-Critic.离线策略梯度算法的温和策略评估
IEEE Trans Neural Netw Learn Syst. 2024 Dec;35(12):17950-17964. doi: 10.1109/TNNLS.2023.3309906. Epub 2024 Dec 2.
7
Realistic Actor-Critic: A framework for balance between value overestimation and underestimation.现实演员-评论家:一个用于平衡价值高估与低估的框架。
Front Neurorobot. 2023 Jan 9;16:1081242. doi: 10.3389/fnbot.2022.1081242. eCollection 2022.
8
Offline Reinforcement Learning With Behavior Value Regularization.基于行为值正则化的离线强化学习
IEEE Trans Cybern. 2024 Jun;54(6):3692-3704. doi: 10.1109/TCYB.2024.3385910. Epub 2024 May 30.
9
Relative Entropy Regularized Sample-Efficient Reinforcement Learning With Continuous Actions.具有连续动作的相对熵正则化样本高效强化学习
IEEE Trans Neural Netw Learn Syst. 2025 Jan;36(1):475-485. doi: 10.1109/TNNLS.2023.3329513. Epub 2025 Jan 7.
10
Reducing Estimation Bias via Triplet-Average Deep Deterministic Policy Gradient.通过三元组平均深度确定性策略梯度减少估计偏差
IEEE Trans Neural Netw Learn Syst. 2020 Nov;31(11):4933-4945. doi: 10.1109/TNNLS.2019.2959129. Epub 2020 Oct 30.

引用本文的文献

1
Trajectory Tracking Control for Robotic Manipulator Based on Soft Actor-Critic and Generative Adversarial Imitation Learning.基于软演员-评论家算法和生成对抗模仿学习的机器人操纵器轨迹跟踪控制
Biomimetics (Basel). 2024 Dec 21;9(12):779. doi: 10.3390/biomimetics9120779.
2
Medical prediction from missing data with max-minus negative regularized dropout.基于最大负正则化随机失活的缺失数据医学预测
Front Neurosci. 2023 Jul 13;17:1221970. doi: 10.3389/fnins.2023.1221970. eCollection 2023.