Cao Junjie, Liu Weiwei, Liu Yong, Yang Jian
Institute of Cyber Systems and Control, Zhejiang University, Hangzhou, China.
China Research and Development Academy of Machinery Equipment, Beijing, China.
Front Neurorobot. 2020 Apr 21;14:21. doi: 10.3389/fnbot.2020.00021. eCollection 2020.
There has been substantial growth in research on the robot automation, which aims to make robots capable of directly interacting with the world or human. Robot learning for automation from human demonstration is central to such situation. However, the dependence of demonstration restricts robot to a fixed scenario, without the ability to explore in variant situations to accomplish the same task as in demonstration. Deep reinforcement learning methods may be a good method to make robot learning beyond human demonstration and fulfilling the task in unknown situations. The exploration is the core of such generalization to different environments. While the exploration in reinforcement learning may be ineffective and suffer from the problem of low sample efficiency. In this paper, we present Evolutionary Policy Gradient (EPG) to make robot learn from demonstration and perform goal oriented exploration efficiently. Through goal oriented exploration, our method can generalize robot learned skill to environments with different parameters. Our Evolutionary Policy Gradient combines parameter perturbation with policy gradient method in the framework of Evolutionary Algorithms (EAs) and can fuse the benefits of both, achieving effective and efficient exploration. With demonstration guiding the evolutionary process, robot can accelerate the goal oriented exploration to generalize its capability to variant scenarios. The experiments, carried out in robot control tasks in OpenAI Gym with dense and sparse rewards, show that our EPG is able to provide competitive performance over the original policy gradient methods and EAs. In the manipulator task, our robot can learn to open the door with vision in environments which are different from where the demonstrations are provided.
机器人自动化研究取得了显著进展,其目标是使机器人能够直接与世界或人类进行交互。从人类示范中进行自动化的机器人学习是这种情况的核心。然而,示范的依赖性将机器人限制在固定场景中,使其无法在不同情况下进行探索以完成与示范相同的任务。深度强化学习方法可能是使机器人超越人类示范学习并在未知情况下完成任务的好方法。探索是这种对不同环境进行泛化的核心。然而,强化学习中的探索可能效率低下,并存在样本效率低的问题。在本文中,我们提出了进化策略梯度(EPG),以使机器人从示范中学习并有效地进行目标导向的探索。通过目标导向的探索,我们的方法可以将机器人学到的技能泛化到具有不同参数的环境中。我们的进化策略梯度在进化算法(EAs)框架中将参数扰动与策略梯度方法相结合,可以融合两者的优点,实现有效且高效的探索。在示范指导进化过程的情况下,机器人可以加速目标导向的探索,以将其能力泛化到不同场景。在OpenAI Gym中使用密集和稀疏奖励进行的机器人控制任务实验表明,我们的EPG能够提供优于原始策略梯度方法和进化算法的竞争性能。在操纵器任务中,我们的机器人可以在与提供示范的环境不同的环境中学习通过视觉开门。