Liakoni Vasiliki, Lehmann Marco P, Modirshanechi Alireza, Brea Johanni, Lutti Antoine, Gerstner Wulfram, Preuschoff Kerstin
École Polytechnique Fédérale de Lausanne (EPFL), School of Computer and Communication Sciences and School of Life Sciences, Lausanne, Switzerland.
École Polytechnique Fédérale de Lausanne (EPFL), School of Computer and Communication Sciences and School of Life Sciences, Lausanne, Switzerland.
Neuroimage. 2022 Feb 1;246:118780. doi: 10.1016/j.neuroimage.2021.118780. Epub 2021 Dec 5.
Learning how to reach a reward over long series of actions is a remarkable capability of humans, and potentially guided by multiple parallel learning modules. Current brain imaging of learning modules is limited by (i) simple experimental paradigms, (ii) entanglement of brain signals of different learning modules, and (iii) a limited number of computational models considered as candidates for explaining behavior. Here, we address these three limitations and (i) introduce a complex sequential decision making task with surprising events that allows us to (ii) dissociate correlates of reward prediction errors from those of surprise in functional magnetic resonance imaging (fMRI); and (iii) we test behavior against a large repertoire of model-free, model-based, and hybrid reinforcement learning algorithms, including a novel surprise-modulated actor-critic algorithm. Surprise, derived from an approximate Bayesian approach for learning the world-model, is extracted in our algorithm from a state prediction error. Surprise is then used to modulate the learning rate of a model-free actor, which itself learns via the reward prediction error from model-free value estimation by the critic. We find that action choices are well explained by pure model-free policy gradient, but reaction times and neural data are not. We identify signatures of both model-free and surprise-based learning signals in blood oxygen level dependent (BOLD) responses, supporting the existence of multiple parallel learning modules in the brain. Our results extend previous fMRI findings to a multi-step setting and emphasize the role of policy gradient and surprise signalling in human learning.
学会如何通过一系列长期行动获得奖励是人类一项非凡的能力,可能由多个并行学习模块引导。当前对学习模块的脑成像受到以下限制:(i)实验范式简单;(ii)不同学习模块的脑信号相互纠缠;(iii)作为行为解释候选的计算模型数量有限。在此,我们解决这三个限制,(i)引入一个带有意外事件的复杂序列决策任务,这使我们能够(ii)在功能磁共振成像(fMRI)中将奖励预测误差的相关因素与意外的相关因素区分开来;(iii)我们针对大量无模型、基于模型和混合强化学习算法测试行为,包括一种新颖的意外调制演员-评论家算法。在我们的算法中,意外源于用于学习世界模型的近似贝叶斯方法,从状态预测误差中提取。然后,意外用于调制无模型演员的学习率,该演员本身通过评论家从无模型价值估计中得到的奖励预测误差进行学习。我们发现行动选择可以通过纯无模型策略梯度得到很好的解释,但反应时间和神经数据则不然。我们在血氧水平依赖(BOLD)反应中识别出无模型和基于意外的学习信号的特征,支持大脑中存在多个并行学习模块。我们的结果将先前的fMRI研究结果扩展到多步骤情境,并强调了策略梯度和意外信号在人类学习中的作用。