通过样本内优势正则化改进用于机器人操作的离线强化学习

Improving Offline Reinforcement Learning With In-Sample Advantage Regularization for Robot Manipulation.

作者信息

Ma Chengzhong, Yang Deyu, Wu Tianyu, Liu Zeyang, Yang Houxue, Chen Xingyu, Lan Xuguang, Zheng Nanning

出版信息

IEEE Trans Neural Netw Learn Syst. 2024 Sep 20;PP. doi: 10.1109/TNNLS.2024.3443102.

DOI:10.1109/TNNLS.2024.3443102

Abstract

Offline reinforcement learning (RL) aims to learn the possible policy from a fixed dataset without real-time interactions with the environment. By avoiding the risky exploration of the robot, this approach is expected to significantly improve the robot's learning efficiency and safety. However, due to errors in value estimation from out-of-distribution actions, most offline RL algorithms constrain or regularize the policy to the actions contained within the dataset. The cost of such methods is the introduction of new hyperparameters and additional complexity. In this article, we aim to adapt offline RL to robotic manipulation with minimal changes and to avoid evaluating out-of-distribution actions as much as possible. Therefore, we improve offline RL with in-sample advantage regularization (ISAR). To mitigate the impact of unseen actions, the ISAR learns the state-value function only with the dataset sample to regress the optimal action-value function. Our method calculates the advantage function of action-state pairs based on in-sample value estimation and adds a behavior cloning (BC) regularization term in the policy update. This improves sample efficiency with minimal changes, resulting in a simple and easy-to-implement method. The experiments of the D4RL robot benchmark and multigoal sparse rewards robotic tasks show that the ISAR achieves excellent performance comparable to current state-of-the-art algorithms without the need for complex parameter tuning and too much training time. In addition, we demonstrate the effectiveness of our method on a real-world robot platform.

摘要

离线强化学习（RL）旨在从固定数据集中学习可能的策略，而无需与环境进行实时交互。通过避免机器人的危险探索，这种方法有望显著提高机器人的学习效率和安全性。然而，由于分布外动作的价值估计存在误差，大多数离线RL算法将策略约束或正则化为数据集中包含的动作。这种方法的代价是引入了新的超参数和额外的复杂性。在本文中，我们旨在以最小的改变使离线RL适用于机器人操作，并尽可能避免评估分布外动作。因此，我们使用样本内优势正则化（ISAR）来改进离线RL。为了减轻未见动作的影响，ISAR仅使用数据集样本学习状态值函数，以回归最优动作值函数。我们的方法基于样本内价值估计计算动作-状态对的优势函数，并在策略更新中添加行为克隆（BC）正则化项。这以最小的改变提高了样本效率，从而得到一种简单且易于实现的方法。D4RL机器人基准测试和多目标稀疏奖励机器人任务的实验表明，ISAR在无需复杂参数调整和过多训练时间的情况下，实现了与当前最先进算法相当的优异性能。此外，我们在真实世界的机器人平台上展示了我们方法的有效性。

相似文献

Improving Offline Reinforcement Learning With In-Sample Advantage Regularization for Robot Manipulation.通过样本内优势正则化改进用于机器人操作的离线强化学习

IEEE Trans Neural Netw Learn Syst. 2024 Sep 20;PP. doi: 10.1109/TNNLS.2024.3443102.

Offline Reinforcement Learning With Behavior Value Regularization.基于行为值正则化的离线强化学习

IEEE Trans Cybern. 2024 Jun;54(6):3692-3704. doi: 10.1109/TCYB.2024.3385910. Epub 2024 May 30.

Adaptive pessimism via target Q-value for offline reinforcement learning.基于目标 Q 值的离线强化学习自适应悲观主义。

Neural Netw. 2024 Dec;180:106588. doi: 10.1016/j.neunet.2024.106588. Epub 2024 Aug 5.

Mild Policy Evaluation for Offline Actor-Critic.离线策略梯度算法的温和策略评估

IEEE Trans Neural Netw Learn Syst. 2024 Dec;35(12):17950-17964. doi: 10.1109/TNNLS.2023.3309906. Epub 2024 Dec 2.

Relative Entropy Regularized Sample-Efficient Reinforcement Learning With Continuous Actions.具有连续动作的相对熵正则化样本高效强化学习

IEEE Trans Neural Netw Learn Syst. 2025 Jan;36(1):475-485. doi: 10.1109/TNNLS.2023.3329513. Epub 2025 Jan 7.

Efficient Offline Reinforcement Learning With Relaxed Conservatism.基于松弛保守主义的高效离线强化学习

IEEE Trans Pattern Anal Mach Intell. 2024 Aug;46(8):5260-5272. doi: 10.1109/TPAMI.2024.3364844. Epub 2024 Jul 2.

Monotonic Quantile Network for Worst-Case Offline Reinforcement Learning.用于最坏情况离线强化学习的单调分位数网络

IEEE Trans Neural Netw Learn Syst. 2024 Jul;35(7):8954-8968. doi: 10.1109/TNNLS.2022.3217189. Epub 2024 Jul 8.

False Correlation Reduction for Offline Reinforcement Learning.离线强化学习中的虚假相关性降低

IEEE Trans Pattern Anal Mach Intell. 2024 Feb;46(2):1199-1211. doi: 10.1109/TPAMI.2023.3328397. Epub 2024 Jan 8.

De-Pessimism Offline Reinforcement Learning via Value Compensation.通过价值补偿实现的离线强化学习去悲观化

IEEE Trans Neural Netw Learn Syst. 2024 Aug 23;PP. doi: 10.1109/TNNLS.2024.3443082.

Human skill knowledge guided global trajectory policy reinforcement learning method.人类技能知识引导的全局轨迹策略强化学习方法。

Front Neurorobot. 2024 Mar 15;18:1368243. doi: 10.3389/fnbot.2024.1368243. eCollection 2024.

通过样本内优势正则化改进用于机器人操作的离线强化学习

Improving Offline Reinforcement Learning With In-Sample Advantage Regularization for Robot Manipulation.

作者信息

Ma Chengzhong, Yang Deyu, Wu Tianyu, Liu Zeyang, Yang Houxue, Chen Xingyu, Lan Xuguang, Zheng Nanning

出版信息

IEEE Trans Neural Netw Learn Syst. 2024 Sep 20;PP. doi: 10.1109/TNNLS.2024.3443102.

DOI:10.1109/TNNLS.2024.3443102

PMID:39302799

Abstract

摘要

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

通过样本内优势正则化改进用于机器人操作的离线强化学习

Improving Offline Reinforcement Learning With In-Sample Advantage Regularization for Robot Manipulation.

作者信息

出版信息

相似文献

通过样本内优势正则化改进用于机器人操作的离线强化学习

Improving Offline Reinforcement Learning With In-Sample Advantage Regularization for Robot Manipulation.

作者信息

出版信息

相似文献