• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一种用于好奇心驱动强化学习的信息论方法。

An information-theoretic approach to curiosity-driven reinforcement learning.

作者信息

Still Susanne, Precup Doina

机构信息

Information and Computer Sciences, University of Hawaii at Mānoa, Honolulu, HI 96822, USA.

出版信息

Theory Biosci. 2012 Sep;131(3):139-48. doi: 10.1007/s12064-011-0142-z. Epub 2012 Jul 12.

DOI:10.1007/s12064-011-0142-z
PMID:22791268
Abstract

We provide a fresh look at the problem of exploration in reinforcement learning, drawing on ideas from information theory. First, we show that Boltzmann-style exploration, one of the main exploration methods used in reinforcement learning, is optimal from an information-theoretic point of view, in that it optimally trades expected return for the coding cost of the policy. Second, we address the problem of curiosity-driven learning. We propose that, in addition to maximizing the expected return, a learner should choose a policy that also maximizes the learner's predictive power. This makes the world both interesting and exploitable. Optimal policies then have the form of Boltzmann-style exploration with a bonus, containing a novel exploration-exploitation trade-off which emerges naturally from the proposed optimization principle. Importantly, this exploration-exploitation trade-off persists in the optimal deterministic policy, i.e., when there is no exploration due to randomness. As a result, exploration is understood as an emerging behavior that optimizes information gain, rather than being modeled as pure randomization of action choices.

摘要

我们借助信息论的思想,对强化学习中的探索问题给出了全新的视角。首先,我们表明,玻尔兹曼式探索作为强化学习中主要的探索方法之一,从信息论的角度来看是最优的,因为它能以最优方式在预期回报和策略的编码成本之间进行权衡。其次,我们探讨了好奇心驱动学习的问题。我们提出,除了最大化预期回报外,学习者还应选择一种能使学习者的预测能力也最大化的策略。这使得世界既有趣又可被利用。最优策略于是具有带奖励的玻尔兹曼式探索的形式,其中包含一种新颖的探索 - 利用权衡,它自然地源于所提出的优化原则。重要的是,这种探索 - 利用权衡在最优确定性策略中依然存在,即当不存在因随机性导致的探索时。结果,探索被理解为一种优化信息增益的新兴行为,而非被建模为动作选择的纯粹随机化。

相似文献

1
An information-theoretic approach to curiosity-driven reinforcement learning.一种用于好奇心驱动强化学习的信息论方法。
Theory Biosci. 2012 Sep;131(3):139-48. doi: 10.1007/s12064-011-0142-z. Epub 2012 Jul 12.
2
Curiosity-driven recommendation strategy for adaptive learning via deep reinforcement learning.基于深度强化学习的好奇心驱动推荐策略,用于自适应学习。
Br J Math Stat Psychol. 2020 Nov;73(3):522-540. doi: 10.1111/bmsp.12199. Epub 2020 Feb 21.
3
Contributions of expected learning progress and perceptual novelty to curiosity-driven exploration.预期学习进展和感知新颖性对好奇心驱动探索的贡献。
Cognition. 2022 Aug;225:105119. doi: 10.1016/j.cognition.2022.105119. Epub 2022 Apr 12.
4
Computational mechanisms of curiosity and goal-directed exploration.好奇心和目标导向探索的计算机制。
Elife. 2019 May 10;8:e41703. doi: 10.7554/eLife.41703.
5
Hierarchical curiosity loops and active sensing.分层好奇心循环与主动感知。
Neural Netw. 2012 Aug;32:119-29. doi: 10.1016/j.neunet.2012.02.024. Epub 2012 Feb 14.
6
Curiosity and the dynamics of optimal exploration.好奇心与最佳探索的动力学。
Trends Cogn Sci. 2024 May;28(5):441-453. doi: 10.1016/j.tics.2024.02.001. Epub 2024 Feb 26.
7
LJIR: Learning Joint-Action Intrinsic Reward in cooperative multi-agent reinforcement learning.LJIR:在合作多智能体强化学习中学习联合行动内在奖励
Neural Netw. 2023 Oct;167:450-459. doi: 10.1016/j.neunet.2023.08.016. Epub 2023 Aug 22.
8
Human Variability and the Explore-Exploit Trade-Off in Recommendation.推荐中的人类可变性和探索-利用权衡
Cogn Sci. 2023 Apr;47(4):e13279. doi: 10.1111/cogs.13279.
9
Protection from uncertainty in the exploration/exploitation trade-off.在探索/开发权衡中保护免受不确定性的影响。
J Exp Psychol Learn Mem Cogn. 2022 Apr;48(4):547-568. doi: 10.1037/xlm0000883. Epub 2021 Jun 10.
10
Pupil diameter predicts changes in the exploration-exploitation trade-off: evidence for the adaptive gain theory.瞳孔直径可预测探索-开发权衡的变化:对适应增益理论的证据。
J Cogn Neurosci. 2011 Jul;23(7):1587-96. doi: 10.1162/jocn.2010.21548. Epub 2010 Jul 28.

引用本文的文献

1
From pixels to planning: scale-free active inference.从像素到规划:无标度主动推理
Front Netw Physiol. 2025 Jun 18;5:1521963. doi: 10.3389/fnetp.2025.1521963. eCollection 2025.
2
Towards Human-Like Emergent Communication via Utility, Informativeness, and Complexity.通过效用、信息性和复杂性实现类人涌现式通信。
Open Mind (Camb). 2025 Apr 2;9:418-451. doi: 10.1162/opmi_a_00188. eCollection 2025.
3
Complex behavior from intrinsic motivation to occupy future action-state path space.源自内在动机的复杂行为,以占据未来行动状态路径空间。

本文引用的文献

1
Efficient computation of optimal actions.最优动作的高效计算。
Proc Natl Acad Sci U S A. 2009 Jul 14;106(28):11478-83. doi: 10.1073/pnas.0710743106. Epub 2009 Jul 2.
2
Reinforcement learning of motor skills with policy gradients.基于策略梯度的运动技能强化学习。
Neural Netw. 2008 May;21(4):682-97. doi: 10.1016/j.neunet.2008.02.003. Epub 2008 Apr 26.
3
How many clusters? An information-theoretic perspective.多少个聚类?一种信息论视角。
Nat Commun. 2024 Jul 29;15(1):6368. doi: 10.1038/s41467-024-49711-1.
4
The Reward-Complexity Trade-off in Schizophrenia.精神分裂症中的奖赏-复杂性权衡
Comput Psychiatr. 2021 May 25;5(1):38-53. doi: 10.5334/cpsy.71. eCollection 2021.
5
Human decision making balances reward maximization and policy compression.人类决策平衡了奖励最大化和策略压缩。
PLoS Comput Biol. 2024 Apr 26;20(4):e1012057. doi: 10.1371/journal.pcbi.1012057. eCollection 2024 Apr.
6
Bayesian Reinforcement Learning With Limited Cognitive Load.认知负荷有限的贝叶斯强化学习
Open Mind (Camb). 2024 Apr 3;8:395-438. doi: 10.1162/opmi_a_00132. eCollection 2024.
7
Federated inference and belief sharing.联邦推理与信念共享。
Neurosci Biobehav Rev. 2024 Jan;156:105500. doi: 10.1016/j.neubiorev.2023.105500. Epub 2023 Dec 5.
8
Bibliometric Analysis of Information Theoretic Studies.信息论研究的文献计量分析
Entropy (Basel). 2022 Sep 25;24(10):1359. doi: 10.3390/e24101359.
9
Rethinking statistical learning as a continuous dynamic stochastic process, from the motor systems perspective.从运动系统的角度重新思考统计学习,将其视为一个连续的动态随机过程。
Front Neurosci. 2022 Nov 8;16:1033776. doi: 10.3389/fnins.2022.1033776. eCollection 2022.
10
Predictive maps in rats and humans for spatial navigation.大鼠和人类空间导航的预测图。
Curr Biol. 2022 Sep 12;32(17):3676-3689.e5. doi: 10.1016/j.cub.2022.06.090. Epub 2022 Jul 20.
Neural Comput. 2004 Dec;16(12):2483-506. doi: 10.1162/0899766042321751.
4
Regularities unseen, randomness observed: levels of entropy convergence.规律不可见,随机性可见:熵收敛水平。
Chaos. 2003 Mar;13(1):25-54. doi: 10.1063/1.1530990.
5
Predictability, complexity, and learning.可预测性、复杂性与学习。
Neural Comput. 2001 Nov;13(11):2409-63. doi: 10.1162/089976601753195969.
6
Statistical mechanics and phase transitions in clustering.聚类中的统计力学与相变
Phys Rev Lett. 1990 Aug 20;65(8):945-948. doi: 10.1103/PhysRevLett.65.945.