Neural Computation Unit, Okinawa Institute of Science and Technology, 1919-1 Tancha, Onna-son, Kunigami, Okinawa 904-0412, Japan.
Eur J Neurosci. 2012 Apr;35(7):1180-9. doi: 10.1111/j.1460-9568.2012.08025.x.
The estimation of reward outcomes for action candidates is essential for decision making. In this study, we examined whether and how the uncertainty in reward outcome estimation affects the action choice and learning rate. We designed a choice task in which rats selected either the left-poking or right-poking hole and received a reward of a food pellet stochastically. The reward probabilities of the left and right holes were chosen from six settings (high, 100% vs. 66%; mid, 66% vs. 33%; low, 33% vs. 0% for the left vs. right holes, and the opposites) in every 20-549 trials. We used Bayesian Q-learning models to estimate the time course of the probability distribution of action values and tested if they better explain the behaviors of rats than standard Q-learning models that estimate only the mean of action values. Model comparison by cross-validation revealed that a Bayesian Q-learning model with an asymmetric update for reward and non-reward outcomes fit the choice time course of the rats best. In the action-choice equation of the Bayesian Q-learning model, the estimated coefficient for the variance of action value was positive, meaning that rats were uncertainty seeking. Further analysis of the Bayesian Q-learning model suggested that the uncertainty facilitated the effective learning rate. These results suggest that the rats consider uncertainty in action-value estimation and that they have an uncertainty-seeking action policy and uncertainty-dependent modulation of the effective learning rate.
对候选动作的奖励结果进行估计对于决策至关重要。在这项研究中,我们研究了奖励结果估计中的不确定性是否以及如何影响动作选择和学习率。我们设计了一个选择任务,其中大鼠选择左戳或右戳孔,并随机获得食物丸作为奖励。左、右孔的奖励概率从六个设置(高,100% 对 66%;中,66% 对 33%;低,33% 对 0%,左对右,反之亦然)中每 20-549 次试验选择一次。我们使用贝叶斯 Q-学习模型来估计动作值概率分布的时间过程,并测试它们是否比仅估计动作值均值的标准 Q-学习模型更好地解释大鼠的行为。通过交叉验证的模型比较表明,对于奖励和非奖励结果具有不对称更新的贝叶斯 Q-学习模型最适合大鼠的选择时间过程。在贝叶斯 Q-学习模型的动作选择方程中,动作值方差的估计系数为正,这意味着大鼠具有寻求不确定性的倾向。对贝叶斯 Q-学习模型的进一步分析表明,不确定性促进了有效学习率的提高。这些结果表明,大鼠考虑了动作值估计中的不确定性,并且它们具有寻求不确定性的动作策略以及对有效学习率的不确定性依赖性调节。