Gershman Samuel J, Pesaran Bijan, Daw Nathaniel D
Center for Neural Science, New York University, New York, New York 10003, USA.
J Neurosci. 2009 Oct 28;29(43):13524-31. doi: 10.1523/JNEUROSCI.2469-09.2009.
Humans and animals are endowed with a large number of effectors. Although this enables great behavioral flexibility, it presents an equally formidable reinforcement learning problem of discovering which actions are most valuable because of the high dimensionality of the action space. An unresolved question is how neural systems for reinforcement learning-such as prediction error signals for action valuation associated with dopamine and the striatum-can cope with this "curse of dimensionality." We propose a reinforcement learning framework that allows for learned action valuations to be decomposed into effector-specific components when appropriate to a task, and test it by studying to what extent human behavior and blood oxygen level-dependent (BOLD) activity can exploit such a decomposition in a multieffector choice task. Subjects made simultaneous decisions with their left and right hands and received separate reward feedback for each hand movement. We found that choice behavior was better described by a learning model that decomposed the values of bimanual movements into separate values for each effector, rather than a traditional model that treated the bimanual actions as unitary with a single value. A decomposition of value into effector-specific components was also observed in value-related BOLD signaling, in the form of lateralized biases in striatal correlates of prediction error and anticipatory value correlates in the intraparietal sulcus. These results suggest that the human brain can use decomposed value representations to "divide and conquer" reinforcement learning over high-dimensional action spaces.
人类和动物具有大量的效应器。虽然这使得行为具有很大的灵活性,但由于动作空间的高维度性,这也带来了一个同样艰巨的强化学习问题,即发现哪些动作最有价值。一个尚未解决的问题是,强化学习的神经系统——比如与多巴胺和纹状体相关的用于动作评估的预测误差信号——如何应对这种“维度诅咒”。我们提出了一个强化学习框架,该框架允许在适合任务时将学习到的动作评估分解为特定效应器的组件,并通过研究人类行为和血氧水平依赖(BOLD)活动在多效应器选择任务中能在多大程度上利用这种分解来对其进行测试。受试者用左手和右手同时做出决策,并为每只手的动作分别获得奖励反馈。我们发现,与将双手动作视为具有单一值的统一体的传统模型相比,一个将双手动作的值分解为每个效应器的单独值的学习模型能更好地描述选择行为。在与价值相关的BOLD信号中也观察到了价值分解为特定效应器组件的情况,表现为纹状体中预测误差的相关性以及顶内沟中预期价值相关性的偏侧化。这些结果表明,人类大脑可以使用分解后的价值表征来“分而治之”高维度动作空间中的强化学习。