Department of Computer Science, University of Bristol, Bristol, UK.
Neural Comput. 2010 May;22(5):1149-79. doi: 10.1162/neco.2010.01-09-948.
Reinforcement learning models generally assume that a stimulus is presented that allows a learner to unambiguously identify the state of nature, and the reward received is drawn from a distribution that depends on that state. However, in any natural environment, the stimulus is noisy. When there is state uncertainty, it is no longer immediately obvious how to perform reinforcement learning, since the observed reward cannot be unambiguously allocated to a state of the environment. This letter addresses the problem of incorporating state uncertainty in reinforcement learning models. We show that simply ignoring the uncertainty and allocating the reward to the most likely state of the environment results in incorrect value estimates. Furthermore, using only the information that is available before observing the reward also results in incorrect estimates. We therefore introduce a new technique, posterior weighted reinforcement learning, in which the estimates of state probabilities are updated according to the observed rewards (e.g., if a learner observes a reward usually associated with a particular state, this state becomes more likely). We show analytically that this modified algorithm can converge to correct reward estimates and confirm this with numerical experiments. The algorithm is shown to be a variant of the expectation-maximization algorithm, allowing rigorous convergence analyses to be carried out. A possible neural implementation of the algorithm in the cortico-basal-ganglia-thalamic network is presented, and experimental predictions of our model are discussed.
强化学习模型通常假设存在一个刺激,学习者可以通过该刺激明确识别自然状态,并且所获得的奖励来自于依赖于该状态的分布。然而,在任何自然环境中,刺激都是有噪声的。当存在状态不确定性时,如何进行强化学习就不再那么明显了,因为观察到的奖励不能明确地分配给环境的某个状态。这封信讨论了在强化学习模型中纳入状态不确定性的问题。我们表明,简单地忽略不确定性并将奖励分配给环境的最可能状态会导致不正确的价值估计。此外,仅使用在观察奖励之前可用的信息也会导致不正确的估计。因此,我们引入了一种新的技术,后验加权强化学习,其中根据观察到的奖励更新状态概率的估计(例如,如果学习者观察到通常与特定状态相关的奖励,那么该状态就更有可能出现)。我们从理论上证明了这种修改后的算法可以收敛到正确的奖励估计值,并通过数值实验对此进行了验证。该算法被证明是期望最大化算法的变体,允许进行严格的收敛分析。我们还提出了皮质基底神经节丘脑网络中该算法的一种可能的神经实现,并讨论了我们模型的实验预测。