Department of Computer Science and Engineering, University of Minnesota, Minneapolis, Minnesota, United States of America.
PLoS Comput Biol. 2010 Dec 2;6(12):e1001003. doi: 10.1371/journal.pcbi.1001003.
Studies of sequential decision-making in humans frequently find suboptimal performance relative to an ideal actor that has perfect knowledge of the model of how rewards and events are generated in the environment. Rather than being suboptimal, we argue that the learning problem humans face is more complex, in that it also involves learning the structure of reward generation in the environment. We formulate the problem of structure learning in sequential decision tasks using Bayesian reinforcement learning, and show that learning the generative model for rewards qualitatively changes the behavior of an optimal learning agent. To test whether people exhibit structure learning, we performed experiments involving a mixture of one-armed and two-armed bandit reward models, where structure learning produces many of the qualitative behaviors deemed suboptimal in previous studies. Our results demonstrate humans can perform structure learning in a near-optimal manner.
人类的序贯决策研究经常发现,相对于具有对环境中奖励和事件生成模型的完美知识的理想行为者,人类的表现并不理想。我们认为,人类面临的学习问题更加复杂,因为它还涉及到学习环境中奖励生成的结构。我们使用贝叶斯强化学习来制定序贯决策任务中的结构学习问题,并表明学习奖励生成的生成模型从本质上改变了最优学习代理的行为。为了测试人们是否表现出结构学习,我们进行了涉及单臂和双臂强盗奖励模型混合的实验,其中结构学习产生了先前研究中被认为是次优的许多定性行为。我们的结果表明,人类可以近乎最优地进行结构学习。