• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

具有状态不确定性的后加权强化学习。

Posterior weighted reinforcement learning with state uncertainty.

机构信息

Department of Computer Science, University of Bristol, Bristol, UK.

出版信息

Neural Comput. 2010 May;22(5):1149-79. doi: 10.1162/neco.2010.01-09-948.

DOI:10.1162/neco.2010.01-09-948
PMID:20100078
Abstract

Reinforcement learning models generally assume that a stimulus is presented that allows a learner to unambiguously identify the state of nature, and the reward received is drawn from a distribution that depends on that state. However, in any natural environment, the stimulus is noisy. When there is state uncertainty, it is no longer immediately obvious how to perform reinforcement learning, since the observed reward cannot be unambiguously allocated to a state of the environment. This letter addresses the problem of incorporating state uncertainty in reinforcement learning models. We show that simply ignoring the uncertainty and allocating the reward to the most likely state of the environment results in incorrect value estimates. Furthermore, using only the information that is available before observing the reward also results in incorrect estimates. We therefore introduce a new technique, posterior weighted reinforcement learning, in which the estimates of state probabilities are updated according to the observed rewards (e.g., if a learner observes a reward usually associated with a particular state, this state becomes more likely). We show analytically that this modified algorithm can converge to correct reward estimates and confirm this with numerical experiments. The algorithm is shown to be a variant of the expectation-maximization algorithm, allowing rigorous convergence analyses to be carried out. A possible neural implementation of the algorithm in the cortico-basal-ganglia-thalamic network is presented, and experimental predictions of our model are discussed.

摘要

强化学习模型通常假设存在一个刺激,学习者可以通过该刺激明确识别自然状态,并且所获得的奖励来自于依赖于该状态的分布。然而,在任何自然环境中,刺激都是有噪声的。当存在状态不确定性时,如何进行强化学习就不再那么明显了,因为观察到的奖励不能明确地分配给环境的某个状态。这封信讨论了在强化学习模型中纳入状态不确定性的问题。我们表明,简单地忽略不确定性并将奖励分配给环境的最可能状态会导致不正确的价值估计。此外,仅使用在观察奖励之前可用的信息也会导致不正确的估计。因此,我们引入了一种新的技术,后验加权强化学习,其中根据观察到的奖励更新状态概率的估计(例如,如果学习者观察到通常与特定状态相关的奖励,那么该状态就更有可能出现)。我们从理论上证明了这种修改后的算法可以收敛到正确的奖励估计值,并通过数值实验对此进行了验证。该算法被证明是期望最大化算法的变体,允许进行严格的收敛分析。我们还提出了皮质基底神经节丘脑网络中该算法的一种可能的神经实现,并讨论了我们模型的实验预测。

相似文献

1
Posterior weighted reinforcement learning with state uncertainty.具有状态不确定性的后加权强化学习。
Neural Comput. 2010 May;22(5):1149-79. doi: 10.1162/neco.2010.01-09-948.
2
Model-based reinforcement learning under concurrent schedules of reinforcement in rodents.啮齿动物在并发强化程序下基于模型的强化学习
Learn Mem. 2009 Apr 29;16(5):315-23. doi: 10.1101/lm.1295509. Print 2009 May.
3
The "proactive" model of learning: Integrative framework for model-free and model-based reinforcement learning utilizing the associative learning-based proactive brain concept.学习的“主动”模型:利用基于联想学习的主动大脑概念的无模型和基于模型的强化学习综合框架。
Behav Neurosci. 2016 Feb;130(1):6-18. doi: 10.1037/bne0000116.
4
Heterarchical reinforcement-learning model for integration of multiple cortico-striatal loops: fMRI examination in stimulus-action-reward association learning.用于整合多个皮质-纹状体环路的异层级强化学习模型:刺激-动作-奖励关联学习中的功能磁共振成像检查
Neural Netw. 2006 Oct;19(8):1242-54. doi: 10.1016/j.neunet.2006.06.007. Epub 2006 Sep 20.
5
Online learning of shaping rewards in reinforcement learning.强化学习中的塑造奖励在线学习。
Neural Netw. 2010 May;23(4):541-50. doi: 10.1016/j.neunet.2010.01.001. Epub 2010 Jan 11.
6
Simulation of rat behavior by a reinforcement learning algorithm in consideration of appearance probabilities of reinforcement signals.考虑强化信号出现概率的强化学习算法对大鼠行为的模拟
Biosystems. 2005 Apr;80(1):83-90. doi: 10.1016/j.biosystems.2004.10.005. Epub 2004 Dec 8.
7
Mechanisms of reinforcement learning and decision making in the primate dorsolateral prefrontal cortex.灵长类动物背外侧前额叶皮层中的强化学习与决策机制。
Ann N Y Acad Sci. 2007 May;1104:108-22. doi: 10.1196/annals.1390.007. Epub 2007 Mar 8.
8
Dynamical model of salience gated working memory, action selection and reinforcement based on basal ganglia and dopamine feedback.基于基底神经节和多巴胺反馈的显著性门控工作记忆、动作选择与强化的动力学模型。
Neural Netw. 2008 Mar-Apr;21(2-3):322-30. doi: 10.1016/j.neunet.2007.12.040. Epub 2007 Dec 31.
9
Adaptive properties of differential learning rates for positive and negative outcomes.正性和负性结果的差异学习率的适应性特性。
Biol Cybern. 2013 Dec;107(6):711-9. doi: 10.1007/s00422-013-0571-5. Epub 2013 Oct 2.
10
A spiking neural model for stable reinforcement of synapses based on multiple distal rewards.基于多个远距离奖励的突触稳定强化的尖峰神经网络模型。
Neural Comput. 2013 Jan;25(1):123-56. doi: 10.1162/NECO_a_00387. Epub 2012 Sep 28.

引用本文的文献

1
Inferring neural activity before plasticity as a foundation for learning beyond backpropagation.在反向传播之外的学习中,将可塑性之前的神经活动推断为学习的基础。
Nat Neurosci. 2024 Feb;27(2):348-358. doi: 10.1038/s41593-023-01514-1. Epub 2024 Jan 3.
2
The role of state uncertainty in the dynamics of dopamine.国家不确定性在多巴胺动态中的作用。
Curr Biol. 2022 Mar 14;32(5):1077-1087.e9. doi: 10.1016/j.cub.2022.01.025. Epub 2022 Feb 2.
3
A probabilistic, distributed, recursive mechanism for decision-making in the brain.大脑中决策的概率分布递归机制。
PLoS Comput Biol. 2018 Apr 3;14(4):e1006033. doi: 10.1371/journal.pcbi.1006033. eCollection 2018 Apr.
4
Learning Reward Uncertainty in the Basal Ganglia.在基底神经节中学习奖励不确定性
PLoS Comput Biol. 2016 Sep 2;12(9):e1005062. doi: 10.1371/journal.pcbi.1005062. eCollection 2016 Sep.
5
Multiplicity of control in the basal ganglia: computational roles of striatal subregions.基底神经节的多重控制:纹状体亚区的计算作用。
Curr Opin Neurobiol. 2011 Jun;21(3):374-80. doi: 10.1016/j.conb.2011.02.009. Epub 2011 Mar 21.