Zeng Hongliang, Zhang Ping, Li Fang, Lin Chubin, Zhou Junkang
IEEE Trans Neural Netw Learn Syst. 2024 Nov;35(11):16602-16615. doi: 10.1109/TNNLS.2023.3296765. Epub 2024 Oct 29.
With shaped reward functions, reinforcement learning (RL) has recently been successfully applied to several robot control tasks. However, designing a task-relevant and well-performing reward function takes time and effort. Still, if RL can train an agent to complete a task in a sparse reward environment, it is an effective way to address the difficulty of reward function design, but it is still a significant challenge. To address this issue, the pioneering hindsight experience replay (HER) method dramatically enhances the probability of acquiring skills in sparse reward environments by transforming unsuccessful experiences into helpful training samples. However, HER still requires a lengthy training period. In this article, we propose a new technique based on HER termed adaptive HER with goal-amended curiosity module (AHEGC) for further enhancing sample and exploration efficiency. Specifically, an adaptive adjustment strategy of hindsight experience (HE) sampling rate and reward weights is developed to enhance sample efficiency. Furthermore, we introduce a curiosity mechanism to encourage more efficient exploration of the environment and propose a goal-amended (GA) curiosity module as a solution to the problem of over-seeking novelty caused by the curiosity introduced. We conducted experiments on six demanding robot control tasks with binary rewards, including Fetch and Hand environments. The results show that the proposed method outperforms existing methods regarding learning ability and convergence speed.
借助成形奖励函数,强化学习(RL)最近已成功应用于多个机器人控制任务。然而,设计一个与任务相关且性能良好的奖励函数需要花费时间和精力。尽管如此,如果强化学习能够训练智能体在稀疏奖励环境中完成任务,这是解决奖励函数设计难题的有效方法,但仍然是一项重大挑战。为了解决这个问题,开创性的indsight经验回放(HER)方法通过将不成功的经验转化为有用的训练样本,显著提高了在稀疏奖励环境中获取技能的概率。然而,HER仍然需要较长的训练周期。在本文中,我们提出了一种基于HER的新技术,称为带有目标修正好奇心模块的自适应HER(AHEGC),以进一步提高样本和探索效率。具体而言,开发了一种indsight经验(HE)采样率和奖励权重的自适应调整策略,以提高样本效率。此外,我们引入了一种好奇心机制,以鼓励对环境进行更高效的探索,并提出了一个目标修正(GA)好奇心模块,作为解决因引入好奇心而导致过度追求新奇性问题的解决方案。我们在包括Fetch和Hand环境在内的六个具有二元奖励的苛刻机器人控制任务上进行了实验。结果表明,所提出的方法在学习能力和收敛速度方面优于现有方法。