Suppr超能文献

通过样本内优势正则化改进用于机器人操作的离线强化学习

Improving Offline Reinforcement Learning With In-Sample Advantage Regularization for Robot Manipulation.

作者信息

Ma Chengzhong, Yang Deyu, Wu Tianyu, Liu Zeyang, Yang Houxue, Chen Xingyu, Lan Xuguang, Zheng Nanning

出版信息

IEEE Trans Neural Netw Learn Syst. 2024 Sep 20;PP. doi: 10.1109/TNNLS.2024.3443102.

Abstract

Offline reinforcement learning (RL) aims to learn the possible policy from a fixed dataset without real-time interactions with the environment. By avoiding the risky exploration of the robot, this approach is expected to significantly improve the robot's learning efficiency and safety. However, due to errors in value estimation from out-of-distribution actions, most offline RL algorithms constrain or regularize the policy to the actions contained within the dataset. The cost of such methods is the introduction of new hyperparameters and additional complexity. In this article, we aim to adapt offline RL to robotic manipulation with minimal changes and to avoid evaluating out-of-distribution actions as much as possible. Therefore, we improve offline RL with in-sample advantage regularization (ISAR). To mitigate the impact of unseen actions, the ISAR learns the state-value function only with the dataset sample to regress the optimal action-value function. Our method calculates the advantage function of action-state pairs based on in-sample value estimation and adds a behavior cloning (BC) regularization term in the policy update. This improves sample efficiency with minimal changes, resulting in a simple and easy-to-implement method. The experiments of the D4RL robot benchmark and multigoal sparse rewards robotic tasks show that the ISAR achieves excellent performance comparable to current state-of-the-art algorithms without the need for complex parameter tuning and too much training time. In addition, we demonstrate the effectiveness of our method on a real-world robot platform.

摘要

离线强化学习(RL)旨在从固定数据集中学习可能的策略,而无需与环境进行实时交互。通过避免机器人的危险探索,这种方法有望显著提高机器人的学习效率和安全性。然而,由于分布外动作的价值估计存在误差,大多数离线RL算法将策略约束或正则化为数据集中包含的动作。这种方法的代价是引入了新的超参数和额外的复杂性。在本文中,我们旨在以最小的改变使离线RL适用于机器人操作,并尽可能避免评估分布外动作。因此,我们使用样本内优势正则化(ISAR)来改进离线RL。为了减轻未见动作的影响,ISAR仅使用数据集样本学习状态值函数,以回归最优动作值函数。我们的方法基于样本内价值估计计算动作-状态对的优势函数,并在策略更新中添加行为克隆(BC)正则化项。这以最小的改变提高了样本效率,从而得到一种简单且易于实现的方法。D4RL机器人基准测试和多目标稀疏奖励机器人任务的实验表明,ISAR在无需复杂参数调整和过多训练时间的情况下,实现了与当前最先进算法相当的优异性能。此外,我们在真实世界的机器人平台上展示了我们方法的有效性。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验