Osa Takayuki, Tangkaratt Voot, Sugiyama Masashi
Kyushu Institute of Technology, 2-4 Hibikino, Wakamatsu, Kita-kyushu, 808-0135, Fukuoka, Japan; RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, 103-0027, Tokyo, Japan.
RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, 103-0027, Tokyo, Japan.
Neural Netw. 2022 Aug;152:90-104. doi: 10.1016/j.neunet.2022.04.009. Epub 2022 Apr 16.
Reinforcement learning algorithms are typically limited to learning a single solution for a specified task, even though diverse solutions often exist. Recent studies showed that learning a set of diverse solutions is beneficial because diversity enables robust few-shot adaptation. Although existing methods learn diverse solutions by using the mutual information as unsupervised rewards, such an approach often suffers from the bias of the gradient estimator induced by value function approximation. In this study, we propose a novel method that can learn diverse solutions without suffering the bias problem. In our method, a policy conditioned on a continuous or discrete latent variable is trained by directly maximizing the variational lower bound of the mutual information, instead of using the mutual information as unsupervised rewards as in previous studies. Through extensive experiments on robot locomotion tasks, we demonstrate that the proposed method successfully learns an infinite set of diverse solutions by learning continuous latent variables, which is more challenging than learning a finite number of solutions. Subsequently, we show that our method enables more effective few-shot adaptation compared with existing methods.
强化学习算法通常局限于为特定任务学习单一解决方案,即便往往存在多种不同的解决方案。近期研究表明,学习一组不同的解决方案是有益的,因为多样性能够实现强大的少样本适应能力。尽管现有方法通过使用互信息作为无监督奖励来学习不同的解决方案,但这种方法常常受到由价值函数近似引起的梯度估计器偏差的影响。在本研究中,我们提出了一种新颖的方法,该方法能够学习不同的解决方案而不会受到偏差问题的困扰。在我们的方法中,通过直接最大化互信息的变分下界来训练以连续或离散潜在变量为条件的策略,而不是像先前研究那样使用互信息作为无监督奖励。通过在机器人运动任务上进行的大量实验,我们证明了所提出的方法通过学习连续潜在变量成功地学习了一组无限的不同解决方案,这比学习有限数量的解决方案更具挑战性。随后,我们表明与现有方法相比,我们的方法能够实现更有效的少样本适应。