通过最大化基于状态-动作的互信息在深度强化学习中发现多样的解决方案。

Discovering diverse solutions in deep reinforcement learning by maximizing state-action-based mutual information.

作者信息

Osa Takayuki, Tangkaratt Voot, Sugiyama Masashi

机构信息

Kyushu Institute of Technology, 2-4 Hibikino, Wakamatsu, Kita-kyushu, 808-0135, Fukuoka, Japan; RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, 103-0027, Tokyo, Japan.

RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, 103-0027, Tokyo, Japan.

出版信息

Neural Netw. 2022 Aug;152:90-104. doi: 10.1016/j.neunet.2022.04.009. Epub 2022 Apr 16.

DOI:10.1016/j.neunet.2022.04.009

PMID:35523085

Abstract

Reinforcement learning algorithms are typically limited to learning a single solution for a specified task, even though diverse solutions often exist. Recent studies showed that learning a set of diverse solutions is beneficial because diversity enables robust few-shot adaptation. Although existing methods learn diverse solutions by using the mutual information as unsupervised rewards, such an approach often suffers from the bias of the gradient estimator induced by value function approximation. In this study, we propose a novel method that can learn diverse solutions without suffering the bias problem. In our method, a policy conditioned on a continuous or discrete latent variable is trained by directly maximizing the variational lower bound of the mutual information, instead of using the mutual information as unsupervised rewards as in previous studies. Through extensive experiments on robot locomotion tasks, we demonstrate that the proposed method successfully learns an infinite set of diverse solutions by learning continuous latent variables, which is more challenging than learning a finite number of solutions. Subsequently, we show that our method enables more effective few-shot adaptation compared with existing methods.

摘要

强化学习算法通常局限于为特定任务学习单一解决方案，即便往往存在多种不同的解决方案。近期研究表明，学习一组不同的解决方案是有益的，因为多样性能够实现强大的少样本适应能力。尽管现有方法通过使用互信息作为无监督奖励来学习不同的解决方案，但这种方法常常受到由价值函数近似引起的梯度估计器偏差的影响。在本研究中，我们提出了一种新颖的方法，该方法能够学习不同的解决方案而不会受到偏差问题的困扰。在我们的方法中，通过直接最大化互信息的变分下界来训练以连续或离散潜在变量为条件的策略，而不是像先前研究那样使用互信息作为无监督奖励。通过在机器人运动任务上进行的大量实验，我们证明了所提出的方法通过学习连续潜在变量成功地学习了一组无限的不同解决方案，这比学习有限数量的解决方案更具挑战性。随后，我们表明与现有方法相比，我们的方法能够实现更有效的少样本适应。

相似文献

Discovering diverse solutions in deep reinforcement learning by maximizing state-action-based mutual information.

Neural Netw. 2022 Aug;152:90-104. doi: 10.1016/j.neunet.2022.04.009. Epub 2022 Apr 16.

Visual Pretraining via Contrastive Predictive Model for Pixel-Based Reinforcement Learning.

Sensors (Basel). 2022 Aug 29;22(17):6504. doi: 10.3390/s22176504.

Context-Based Meta-Reinforcement Learning With Bayesian Nonparametric Models.

IEEE Trans Pattern Anal Mach Intell. 2024 Oct;46(10):6948-6965. doi: 10.1109/TPAMI.2024.3386780. Epub 2024 Sep 5.

Multimodal information bottleneck for deep reinforcement learning with multiple sensors.

Neural Netw. 2024 Aug;176:106347. doi: 10.1016/j.neunet.2024.106347. Epub 2024 Apr 27.

Quality-diversity based semi-autonomous teleoperation using reinforcement learning.

Neural Netw. 2024 Nov;179:106543. doi: 10.1016/j.neunet.2024.106543. Epub 2024 Jul 22.

LJIR: Learning Joint-Action Intrinsic Reward in cooperative multi-agent reinforcement learning.

Neural Netw. 2023 Oct;167:450-459. doi: 10.1016/j.neunet.2023.08.016. Epub 2023 Aug 22.

Modular deep reinforcement learning from reward and punishment for robot navigation.

Neural Netw. 2021 Mar;135:115-126. doi: 10.1016/j.neunet.2020.12.001. Epub 2020 Dec 8.

MuDE: Multi-agent decomposed reward-based exploration.

Neural Netw. 2024 Nov;179:106565. doi: 10.1016/j.neunet.2024.106565. Epub 2024 Jul 22.

Action-driven contrastive representation for reinforcement learning.

PLoS One. 2022 Mar 18;17(3):e0265456. doi: 10.1371/journal.pone.0265456. eCollection 2022.

Sequential action-induced invariant representation for reinforcement learning.

Neural Netw. 2024 Nov;179:106579. doi: 10.1016/j.neunet.2024.106579. Epub 2024 Jul 26.

引用本文的文献

Advancing e-commerce user purchase prediction: Integration of time-series attention with event-based timestamp encoding and Graph Neural Network-Enhanced user profiling.

PLoS One. 2024 Apr 18;19(4):e0299087. doi: 10.1371/journal.pone.0299087. eCollection 2024.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

通过最大化基于状态-动作的互信息在深度强化学习中发现多样的解决方案。

Discovering diverse solutions in deep reinforcement learning by maximizing state-action-based mutual information.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献