• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

通过最大化基于状态-动作的互信息在深度强化学习中发现多样的解决方案。

Discovering diverse solutions in deep reinforcement learning by maximizing state-action-based mutual information.

作者信息

Osa Takayuki, Tangkaratt Voot, Sugiyama Masashi

机构信息

Kyushu Institute of Technology, 2-4 Hibikino, Wakamatsu, Kita-kyushu, 808-0135, Fukuoka, Japan; RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, 103-0027, Tokyo, Japan.

RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, 103-0027, Tokyo, Japan.

出版信息

Neural Netw. 2022 Aug;152:90-104. doi: 10.1016/j.neunet.2022.04.009. Epub 2022 Apr 16.

DOI:10.1016/j.neunet.2022.04.009
PMID:35523085
Abstract

Reinforcement learning algorithms are typically limited to learning a single solution for a specified task, even though diverse solutions often exist. Recent studies showed that learning a set of diverse solutions is beneficial because diversity enables robust few-shot adaptation. Although existing methods learn diverse solutions by using the mutual information as unsupervised rewards, such an approach often suffers from the bias of the gradient estimator induced by value function approximation. In this study, we propose a novel method that can learn diverse solutions without suffering the bias problem. In our method, a policy conditioned on a continuous or discrete latent variable is trained by directly maximizing the variational lower bound of the mutual information, instead of using the mutual information as unsupervised rewards as in previous studies. Through extensive experiments on robot locomotion tasks, we demonstrate that the proposed method successfully learns an infinite set of diverse solutions by learning continuous latent variables, which is more challenging than learning a finite number of solutions. Subsequently, we show that our method enables more effective few-shot adaptation compared with existing methods.

摘要

强化学习算法通常局限于为特定任务学习单一解决方案,即便往往存在多种不同的解决方案。近期研究表明,学习一组不同的解决方案是有益的,因为多样性能够实现强大的少样本适应能力。尽管现有方法通过使用互信息作为无监督奖励来学习不同的解决方案,但这种方法常常受到由价值函数近似引起的梯度估计器偏差的影响。在本研究中,我们提出了一种新颖的方法,该方法能够学习不同的解决方案而不会受到偏差问题的困扰。在我们的方法中,通过直接最大化互信息的变分下界来训练以连续或离散潜在变量为条件的策略,而不是像先前研究那样使用互信息作为无监督奖励。通过在机器人运动任务上进行的大量实验,我们证明了所提出的方法通过学习连续潜在变量成功地学习了一组无限的不同解决方案,这比学习有限数量的解决方案更具挑战性。随后,我们表明与现有方法相比,我们的方法能够实现更有效的少样本适应。

相似文献

1
Discovering diverse solutions in deep reinforcement learning by maximizing state-action-based mutual information.通过最大化基于状态-动作的互信息在深度强化学习中发现多样的解决方案。
Neural Netw. 2022 Aug;152:90-104. doi: 10.1016/j.neunet.2022.04.009. Epub 2022 Apr 16.
2
Visual Pretraining via Contrastive Predictive Model for Pixel-Based Reinforcement Learning.基于像素的强化学习的对比预测模型的视觉预训练。
Sensors (Basel). 2022 Aug 29;22(17):6504. doi: 10.3390/s22176504.
3
Context-Based Meta-Reinforcement Learning With Bayesian Nonparametric Models.基于上下文的贝叶斯非参数模型元强化学习
IEEE Trans Pattern Anal Mach Intell. 2024 Oct;46(10):6948-6965. doi: 10.1109/TPAMI.2024.3386780. Epub 2024 Sep 5.
4
Multimodal information bottleneck for deep reinforcement learning with multiple sensors.多模态信息瓶颈用于多传感器的深度强化学习。
Neural Netw. 2024 Aug;176:106347. doi: 10.1016/j.neunet.2024.106347. Epub 2024 Apr 27.
5
Quality-diversity based semi-autonomous teleoperation using reinforcement learning.基于质量-多样性的强化学习半自动遥操作。
Neural Netw. 2024 Nov;179:106543. doi: 10.1016/j.neunet.2024.106543. Epub 2024 Jul 22.
6
LJIR: Learning Joint-Action Intrinsic Reward in cooperative multi-agent reinforcement learning.LJIR:在合作多智能体强化学习中学习联合行动内在奖励
Neural Netw. 2023 Oct;167:450-459. doi: 10.1016/j.neunet.2023.08.016. Epub 2023 Aug 22.
7
Modular deep reinforcement learning from reward and punishment for robot navigation.基于奖惩的机器人导航模块化深度强化学习。
Neural Netw. 2021 Mar;135:115-126. doi: 10.1016/j.neunet.2020.12.001. Epub 2020 Dec 8.
8
MuDE: Multi-agent decomposed reward-based exploration.MuDE:基于多代理分解奖励的探索。
Neural Netw. 2024 Nov;179:106565. doi: 10.1016/j.neunet.2024.106565. Epub 2024 Jul 22.
9
Action-driven contrastive representation for reinforcement learning.基于动作的强化学习对比表示。
PLoS One. 2022 Mar 18;17(3):e0265456. doi: 10.1371/journal.pone.0265456. eCollection 2022.
10
Sequential action-induced invariant representation for reinforcement learning.强化学习中的序贯动作诱导不变表示。
Neural Netw. 2024 Nov;179:106579. doi: 10.1016/j.neunet.2024.106579. Epub 2024 Jul 26.

引用本文的文献

1
Advancing e-commerce user purchase prediction: Integration of time-series attention with event-based timestamp encoding and Graph Neural Network-Enhanced user profiling.推进电子商务用户购买预测:基于事件的时间戳编码与图神经网络增强用户画像的时间序列注意力集成。
PLoS One. 2024 Apr 18;19(4):e0299087. doi: 10.1371/journal.pone.0299087. eCollection 2024.