• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

具有三种改进的分布软演员-评论家算法

Distributional Soft Actor-Critic With Three Refinements.

作者信息

Duan Jingliang, Wang Wenxuan, Xiao Liming, Gao Jiaxin, Li Shengbo Eben, Liu Chang, Zhang Ya-Qin, Cheng Bo, Li Keqiang

出版信息

IEEE Trans Pattern Anal Mach Intell. 2025 May;47(5):3935-3946. doi: 10.1109/TPAMI.2025.3537087. Epub 2025 Apr 8.

DOI:10.1109/TPAMI.2025.3537087
PMID:40031258
Abstract

Reinforcement learning (RL) has shown remarkable success in solving complex decision-making and control tasks. However, many model-free RL algorithms experience performance degradation due to inaccurate value estimation, particularly the overestimation of Q-values, which can lead to suboptimal policies. To address this issue, we previously proposed the Distributional Soft Actor-Critic (DSAC or DSACv1), an off-policy RL algorithm that enhances value estimation accuracy by learning a continuous Gaussian value distribution. Despite its effectiveness, DSACv1 faces challenges such as training instability and sensitivity to reward scaling, caused by high variance in critic gradients due to return randomness. In this paper, we introduce three key refinements to DSACv1 to overcome these limitations and further improve Q-value estimation accuracy: expected value substitution, twin value distribution learning, and variance-based critic gradient adjustment. The enhanced algorithm, termed DSAC with Three refinements (DSAC-T or DSACv2), is systematically evaluated across a diverse set of benchmark tasks. Without the need for task-specific hyperparameter tuning, DSAC-T consistently matches or outperforms leading model-free RL algorithms, including SAC, TD3, DDPG, TRPO, and PPO, in all tested environments. Additionally, DSAC-T ensures a stable learning process and maintains robust performance across varying reward scales. Its effectiveness is further demonstrated through real-world application in controlling a wheeled robot, highlighting its potential for deployment in practical robotic tasks.

摘要

强化学习(RL)在解决复杂决策和控制任务方面已取得显著成功。然而,许多无模型RL算法由于价值估计不准确,特别是Q值的高估,导致性能下降,这可能会导致次优策略。为了解决这个问题,我们之前提出了分布软演员-评论家算法(DSAC或DSACv1),这是一种离策略RL算法,通过学习连续高斯价值分布来提高价值估计的准确性。尽管DSACv1很有效,但由于回报随机性导致评论家梯度的高方差,它面临着训练不稳定和对奖励缩放敏感等挑战。在本文中,我们对DSACv1进行了三项关键改进,以克服这些限制并进一步提高Q值估计的准确性:期望值替换、双价值分布学习和基于方差的评论家梯度调整。改进后的算法称为带三项改进的DSAC(DSAC-T或DSACv2),在各种基准任务中进行了系统评估。无需针对特定任务进行超参数调整,DSAC-T在所有测试环境中始终与领先的无模型RL算法(包括SAC、TD3、DDPG、TRPO和PPO)相匹配或表现更优。此外,DSAC-T确保了稳定的学习过程,并在不同的奖励尺度上保持了强大的性能。通过在控制轮式机器人的实际应用中进一步证明了其有效性,突出了其在实际机器人任务中部署的潜力。

相似文献

1
Distributional Soft Actor-Critic With Three Refinements.具有三种改进的分布软演员-评论家算法
IEEE Trans Pattern Anal Mach Intell. 2025 May;47(5):3935-3946. doi: 10.1109/TPAMI.2025.3537087. Epub 2025 Apr 8.
2
Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors.分布软演员-评论家:用于解决价值估计误差的离策略强化学习
IEEE Trans Neural Netw Learn Syst. 2022 Nov;33(11):6584-6598. doi: 10.1109/TNNLS.2021.3082568. Epub 2022 Oct 27.
3
Broad Critic Deep Actor Reinforcement Learning for Continuous Control.用于连续控制的广义批评深度演员强化学习
IEEE Trans Neural Netw Learn Syst. 2025 Apr 8;PP. doi: 10.1109/TNNLS.2025.3554082.
4
Episodic Memory-Double Actor-Critic Twin Delayed Deep Deterministic Policy Gradient.情景记忆 - 双智能体 - 评论家双延迟深度确定性策略梯度
Neural Netw. 2025 Jul;187:107286. doi: 10.1016/j.neunet.2025.107286. Epub 2025 Feb 27.
5
Improved Soft Actor-Critic: Mixing Prioritized Off-Policy Samples With On-Policy Experiences.改进的软演员-评论家算法:将优先离策略样本与在线策略经验相结合。
IEEE Trans Neural Netw Learn Syst. 2024 Mar;35(3):3121-3129. doi: 10.1109/TNNLS.2022.3174051. Epub 2024 Feb 29.
6
Meta attention for Off-Policy Actor-Critic.用于离策略演员-评论家的元注意力机制
Neural Netw. 2023 Jun;163:86-96. doi: 10.1016/j.neunet.2023.03.024. Epub 2023 Mar 28.
7
Stochastic Integrated Actor-Critic for Deep Reinforcement Learning.用于深度强化学习的随机集成演员-评论家算法
IEEE Trans Neural Netw Learn Syst. 2024 May;35(5):6654-6666. doi: 10.1109/TNNLS.2022.3212273. Epub 2024 May 2.
8
The Actor-Dueling-Critic Method for Reinforcement Learning.强化学习中的演员-决斗-批评者方法。
Sensors (Basel). 2019 Mar 30;19(7):1547. doi: 10.3390/s19071547.
9
Relative importance sampling for off-policy actor-critic in deep reinforcement learning.深度强化学习中离策略演员-评论家的相对重要性采样
Sci Rep. 2025 Apr 24;15(1):14349. doi: 10.1038/s41598-025-96201-5.
10
Deep Reinforcement Learning-Based Accurate Control of Planetary Soft Landing.基于深度强化学习的行星软着陆精确控制
Sensors (Basel). 2021 Dec 6;21(23):8161. doi: 10.3390/s21238161.