Zhang Tiantian, Lin Zichuan, Wang Yuxing, Ye Deheng, Fu Qiang, Yang Wei, Wang Xueqian, Liang Bin, Yuan Bo, Li Xiu
IEEE Trans Neural Netw Learn Syst. 2024 Oct;35(10):14588-14602. doi: 10.1109/TNNLS.2023.3280085. Epub 2024 Oct 7.
A key challenge of continual reinforcement learning (CRL) in dynamic environments is to promptly adapt the reinforcement learning (RL) agent's behavior as the environment changes over its lifetime while minimizing the catastrophic forgetting of the learned information. To address this challenge, in this article, we propose DaCoRL, that is, dynamics-adaptive continual RL. DaCoRL learns a context-conditioned policy using progressive contextualization, which incrementally clusters a stream of stationary tasks in the dynamic environment into a series of contexts and opts for an expandable multihead neural network to approximate the policy. Specifically, we define a set of tasks with similar dynamics as an environmental context and formalize context inference as a procedure of online Bayesian infinite Gaussian mixture clustering on environment features, resorting to online Bayesian inference to infer the posterior distribution over contexts. Under the assumption of a Chinese restaurant process (CRP) prior, this technique can accurately classify the current task as a previously seen context or instantiate a new context as needed without relying on any external indicator to signal environmental changes in advance. Furthermore, we employ an expandable multihead neural network whose output layer is synchronously expanded with the newly instantiated context and a knowledge distillation regularization term for retaining the performance on learned tasks. As a general framework that can be coupled with various deep RL algorithms, DaCoRL features consistent superiority over existing methods in terms of stability, overall performance, and generalization ability, as verified by extensive experiments on several robot navigation and MuJoCo locomotion tasks.
动态环境中持续强化学习(CRL)的一个关键挑战是,随着环境在其生命周期内发生变化,要及时调整强化学习(RL)智能体的行为,同时尽量减少对所学信息的灾难性遗忘。为应对这一挑战,在本文中,我们提出了DaCoRL,即动态自适应持续强化学习。DaCoRL使用渐进式情境化学习上下文条件策略,它将动态环境中一系列平稳任务逐步聚类为一系列上下文,并选择一个可扩展的多头神经网络来近似该策略。具体来说,我们将具有相似动态的一组任务定义为一个环境上下文,并将上下文推理形式化为基于环境特征的在线贝叶斯无限高斯混合聚类过程,借助在线贝叶斯推理来推断上下文的后验分布。在中餐厅过程(CRP)先验的假设下,该技术可以将当前任务准确分类为先前见过的上下文,或者根据需要实例化一个新的上下文,而无需依赖任何外部指标来提前指示环境变化。此外,我们采用一个可扩展的多头神经网络,其输出层与新实例化的上下文同步扩展,并使用一个知识蒸馏正则化项来保持在已学任务上的性能。作为一个可以与各种深度强化学习算法相结合的通用框架,通过在多个机器人导航和MuJoCo运动任务上的大量实验验证,DaCoRL在稳定性、整体性能和泛化能力方面比现有方法具有持续的优势。