迈向机器人环境中深度交互式强化学习的广泛持久咨询方法。

Towards a Broad-Persistent Advising Approach for Deep Interactive Reinforcement Learning in Robotic Environments.

机构信息

School of Information Technology, Deakin University, Geelong 3220, Australia.

School of Computer Science and Engineering, University of New South Wales, Sydney 2052, Australia.

出版信息

Sensors (Basel). 2023 Mar 1;23(5):2681. doi: 10.3390/s23052681.

DOI:10.3390/s23052681

PMID:36904885

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10007476/

Abstract

Deep Reinforcement Learning (DeepRL) methods have been widely used in robotics to learn about the environment and acquire behaviours autonomously. Deep Interactive Reinforcement 2 Learning (DeepIRL) includes interactive feedback from an external trainer or expert giving advice to help learners choose actions to speed up the learning process. However, current research has been limited to interactions that offer actionable advice to only the current state of the agent. Additionally, the information is discarded by the agent after a single use, which causes a duplicate process at the same state for a revisit. In this paper, we present Broad-Persistent Advising (BPA), an approach that retains and reuses the processed information. It not only helps trainers give more general advice relevant to similar states instead of only the current state, but also allows the agent to speed up the learning process. We tested the proposed approach in two continuous robotic scenarios, namely a cart pole balancing task and a simulated robot navigation task. The results demonstrated that the agent's learning speed increased, as evidenced by the rising reward points of up to 37%, while maintaining the number of interactions required for the trainer, in comparison to the DeepIRL approach.

摘要

深度强化学习 (DeepRL) 方法已广泛应用于机器人学，以实现自主学习环境和行为。深度交互式强化学习 2 (DeepIRL) 包括来自外部训练师或专家的交互式反馈，提供建议以帮助学习者选择行动，从而加速学习过程。然而，目前的研究仅限于提供可操作建议的交互，这些建议仅针对代理的当前状态。此外，代理在单次使用后会丢弃信息，这导致在同一状态下重复该过程。在本文中，我们提出了广泛持久建议 (BPA)，这是一种保留和重用处理信息的方法。它不仅帮助训练师提供更通用的建议，这些建议与相似状态相关，而不仅仅是当前状态，而且还允许代理加速学习过程。我们在两个连续的机器人场景中测试了所提出的方法，即推车杆平衡任务和模拟机器人导航任务。结果表明，代理的学习速度加快，奖励点数最高可提高 37%，同时与 DeepIRL 方法相比，保持了训练师所需的交互次数。