Jiang Peng, Song Shiji, Huang Gao
IEEE Trans Neural Netw Learn Syst. 2023 Aug;34(8):4033-4046. doi: 10.1109/TNNLS.2021.3121432. Epub 2023 Aug 4.
Meta reinforcement learning (meta-RL) is a promising technique for fast task adaptation by leveraging prior knowledge from previous tasks. Recently, context-based meta-RL has been proposed to improve data efficiency by applying a principled framework, dividing the learning procedure into task inference and task execution. However, the task information is not adequately leveraged in this approach, thus leading to inefficient exploration. To address this problem, we propose a novel context-based meta-RL framework with an improved exploration mechanism. For the existing exploration and execution problem in context-based meta-RL, we propose a novel objective that employs two exploration terms to encourage better exploration in action and task embedding space, respectively. The first term pushes for improving the diversity of task inference, while the second term, named action information, works as sharing or hiding task information in different exploration stages. We divide the meta-training procedure into task-independent exploration and task-relevant exploration stages according to the utilization of action information. By decoupling task inference and task execution and proposing the respective optimization objectives in the two exploration stages, we can efficiently learn policy and task inference networks. We compare our algorithm with several popular meta-RL methods on MuJoco benchmarks with both dense and sparse reward settings. The empirical results show that our method significantly outperforms baselines on the benchmarks in terms of sample efficiency and task performance.
元强化学习(meta-RL)是一种很有前途的技术,可通过利用先前任务的先验知识来实现快速任务适应。最近,基于上下文的元强化学习被提出来,通过应用一个有原则的框架来提高数据效率,将学习过程分为任务推理和任务执行。然而,这种方法没有充分利用任务信息,从而导致探索效率低下。为了解决这个问题,我们提出了一种具有改进探索机制的新型基于上下文的元强化学习框架。针对基于上下文的元强化学习中现有的探索和执行问题,我们提出了一个新颖的目标,该目标采用两个探索项,分别鼓励在动作和任务嵌入空间中进行更好的探索。第一个项推动提高任务推理的多样性,而第二个项,称为动作信息,在不同的探索阶段起到共享或隐藏任务信息的作用。我们根据动作信息的利用情况,将元训练过程分为与任务无关的探索阶段和与任务相关的探索阶段。通过解耦任务推理和任务执行,并在两个探索阶段提出各自的优化目标,我们可以有效地学习策略和任务推理网络。我们在具有密集和稀疏奖励设置的MuJoco基准上,将我们的算法与几种流行的元强化学习方法进行了比较。实证结果表明,我们的方法在样本效率和任务性能方面显著优于基准方法。