• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过样本内优势正则化改进用于机器人操作的离线强化学习

Improving Offline Reinforcement Learning With In-Sample Advantage Regularization for Robot Manipulation.

作者信息

Ma Chengzhong, Yang Deyu, Wu Tianyu, Liu Zeyang, Yang Houxue, Chen Xingyu, Lan Xuguang, Zheng Nanning

出版信息

IEEE Trans Neural Netw Learn Syst. 2024 Sep 20;PP. doi: 10.1109/TNNLS.2024.3443102.

DOI:10.1109/TNNLS.2024.3443102
PMID:39302799
Abstract

Offline reinforcement learning (RL) aims to learn the possible policy from a fixed dataset without real-time interactions with the environment. By avoiding the risky exploration of the robot, this approach is expected to significantly improve the robot's learning efficiency and safety. However, due to errors in value estimation from out-of-distribution actions, most offline RL algorithms constrain or regularize the policy to the actions contained within the dataset. The cost of such methods is the introduction of new hyperparameters and additional complexity. In this article, we aim to adapt offline RL to robotic manipulation with minimal changes and to avoid evaluating out-of-distribution actions as much as possible. Therefore, we improve offline RL with in-sample advantage regularization (ISAR). To mitigate the impact of unseen actions, the ISAR learns the state-value function only with the dataset sample to regress the optimal action-value function. Our method calculates the advantage function of action-state pairs based on in-sample value estimation and adds a behavior cloning (BC) regularization term in the policy update. This improves sample efficiency with minimal changes, resulting in a simple and easy-to-implement method. The experiments of the D4RL robot benchmark and multigoal sparse rewards robotic tasks show that the ISAR achieves excellent performance comparable to current state-of-the-art algorithms without the need for complex parameter tuning and too much training time. In addition, we demonstrate the effectiveness of our method on a real-world robot platform.

摘要

离线强化学习(RL)旨在从固定数据集中学习可能的策略,而无需与环境进行实时交互。通过避免机器人的危险探索,这种方法有望显著提高机器人的学习效率和安全性。然而,由于分布外动作的价值估计存在误差,大多数离线RL算法将策略约束或正则化为数据集中包含的动作。这种方法的代价是引入了新的超参数和额外的复杂性。在本文中,我们旨在以最小的改变使离线RL适用于机器人操作,并尽可能避免评估分布外动作。因此,我们使用样本内优势正则化(ISAR)来改进离线RL。为了减轻未见动作的影响,ISAR仅使用数据集样本学习状态值函数,以回归最优动作值函数。我们的方法基于样本内价值估计计算动作-状态对的优势函数,并在策略更新中添加行为克隆(BC)正则化项。这以最小的改变提高了样本效率,从而得到一种简单且易于实现的方法。D4RL机器人基准测试和多目标稀疏奖励机器人任务的实验表明,ISAR在无需复杂参数调整和过多训练时间的情况下,实现了与当前最先进算法相当的优异性能。此外,我们在真实世界的机器人平台上展示了我们方法的有效性。

相似文献

1
Improving Offline Reinforcement Learning With In-Sample Advantage Regularization for Robot Manipulation.通过样本内优势正则化改进用于机器人操作的离线强化学习
IEEE Trans Neural Netw Learn Syst. 2024 Sep 20;PP. doi: 10.1109/TNNLS.2024.3443102.
2
Offline Reinforcement Learning With Behavior Value Regularization.基于行为值正则化的离线强化学习
IEEE Trans Cybern. 2024 Jun;54(6):3692-3704. doi: 10.1109/TCYB.2024.3385910. Epub 2024 May 30.
3
Adaptive pessimism via target Q-value for offline reinforcement learning.基于目标 Q 值的离线强化学习自适应悲观主义。
Neural Netw. 2024 Dec;180:106588. doi: 10.1016/j.neunet.2024.106588. Epub 2024 Aug 5.
4
Mild Policy Evaluation for Offline Actor-Critic.离线策略梯度算法的温和策略评估
IEEE Trans Neural Netw Learn Syst. 2024 Dec;35(12):17950-17964. doi: 10.1109/TNNLS.2023.3309906. Epub 2024 Dec 2.
5
Relative Entropy Regularized Sample-Efficient Reinforcement Learning With Continuous Actions.具有连续动作的相对熵正则化样本高效强化学习
IEEE Trans Neural Netw Learn Syst. 2025 Jan;36(1):475-485. doi: 10.1109/TNNLS.2023.3329513. Epub 2025 Jan 7.
6
Efficient Offline Reinforcement Learning With Relaxed Conservatism.基于松弛保守主义的高效离线强化学习
IEEE Trans Pattern Anal Mach Intell. 2024 Aug;46(8):5260-5272. doi: 10.1109/TPAMI.2024.3364844. Epub 2024 Jul 2.
7
Monotonic Quantile Network for Worst-Case Offline Reinforcement Learning.用于最坏情况离线强化学习的单调分位数网络
IEEE Trans Neural Netw Learn Syst. 2024 Jul;35(7):8954-8968. doi: 10.1109/TNNLS.2022.3217189. Epub 2024 Jul 8.
8
False Correlation Reduction for Offline Reinforcement Learning.离线强化学习中的虚假相关性降低
IEEE Trans Pattern Anal Mach Intell. 2024 Feb;46(2):1199-1211. doi: 10.1109/TPAMI.2023.3328397. Epub 2024 Jan 8.
9
De-Pessimism Offline Reinforcement Learning via Value Compensation.通过价值补偿实现的离线强化学习去悲观化
IEEE Trans Neural Netw Learn Syst. 2024 Aug 23;PP. doi: 10.1109/TNNLS.2024.3443082.
10
Human skill knowledge guided global trajectory policy reinforcement learning method.人类技能知识引导的全局轨迹策略强化学习方法。
Front Neurorobot. 2024 Mar 15;18:1368243. doi: 10.3389/fnbot.2024.1368243. eCollection 2024.