Wang Zhi, Chen Chunlin, Dong Daoyi
IEEE Trans Cybern. 2023 Dec;53(12):7509-7520. doi: 10.1109/TCYB.2022.3170485. Epub 2023 Nov 29.
While reinforcement learning (RL) algorithms are achieving state-of-the-art performance in various challenging tasks, they can easily encounter catastrophic forgetting or interference when faced with lifelong streaming information. In this article, we propose a scalable lifelong RL method that dynamically expands the network capacity to accommodate new knowledge while preventing past memories from being perturbed. We use a Dirichlet process mixture to model the nonstationary task distribution, which captures task relatedness by estimating the likelihood of task-to-cluster assignments and clusters the task models in a latent space. We formulate the prior distribution of the mixture as a Chinese restaurant process (CRP) that instantiates new mixture components as needed. The update and expansion of the mixture are governed by the Bayesian nonparametric framework with an expectation maximization (EM) procedure, which dynamically adapts the model complexity without explicit task boundaries or heuristics. Moreover, we use the domain randomization technique to train robust prior parameters for the initialization of each task model in the mixture; thus, the resulting model can better generalize and adapt to unseen tasks. With extensive experiments conducted on robot navigation and locomotion domains, we show that our method successfully facilitates scalable lifelong RL and outperforms relevant existing methods.
虽然强化学习(RL)算法在各种具有挑战性的任务中取得了领先的性能,但当面对终身流信息时,它们很容易遇到灾难性遗忘或干扰。在本文中,我们提出了一种可扩展的终身RL方法,该方法动态扩展网络容量以容纳新知识,同时防止过去的记忆受到干扰。我们使用狄利克雷过程混合来对非平稳任务分布进行建模,通过估计任务到聚类分配的可能性来捕获任务相关性,并在潜在空间中对任务模型进行聚类。我们将混合的先验分布公式化为中餐厅过程(CRP),该过程根据需要实例化新的混合组件。混合的更新和扩展由具有期望最大化(EM)过程的贝叶斯非参数框架控制,该框架动态调整模型复杂度,无需明确的任务边界或启发式方法。此外,我们使用域随机化技术来训练健壮的先验参数,用于混合中每个任务模型的初始化;因此,得到的模型可以更好地泛化并适应未见任务。通过在机器人导航和运动领域进行的大量实验,我们表明我们的方法成功地促进了可扩展的终身RL,并且优于相关的现有方法。