Khamassi Mehdi, Girard Benoît
Institute of Intelligent Systems and Robotics (ISIR), Sorbonne Université and CNRS (Centre National de la Recherche Scientifique), 75005, Paris, France.
Biol Cybern. 2020 Apr;114(2):231-248. doi: 10.1007/s00422-020-00817-x. Epub 2020 Feb 17.
Hippocampal offline reactivations during reward-based learning, usually categorized as replay events, have been found to be important for performance improvement over time and for memory consolidation. Recent computational work has linked these phenomena to the need to transform reward information into state-action values for decision making and to propagate it to all relevant states of the environment. Nevertheless, it is still unclear whether an integrated reinforcement learning mechanism could account for the variety of awake hippocampal reactivations, including variety in order (forward and reverse reactivated trajectories) and variety in the location where they occur (reward site or decision-point). Here, we present a model-based bidirectional search model which accounts for a variety of hippocampal reactivations. The model combines forward trajectory sampling from current position and backward sampling through prioritized sweeping from states associated with large reward prediction errors until the two trajectories connect. This is repeated until stabilization of state-action values (convergence), which could explain why hippocampal reactivations drastically diminish when the animal's performance stabilizes. Simulations in a multiple T-maze task show that forward reactivations are prominently found at decision-points while backward reactivations are exclusively generated at reward sites. Finally, the model can generate imaginary trajectories that are not allowed to the agent during task performance. We raise some experimental predictions and implications for future studies of the role of the hippocampo-prefronto-striatal network in learning.
在基于奖励的学习过程中,海马体的离线再激活通常被归类为回放事件,已被发现对于随着时间推移提高表现以及记忆巩固很重要。最近的计算工作已将这些现象与将奖励信息转化为用于决策的状态-动作值并将其传播到环境的所有相关状态的需求联系起来。然而,尚不清楚一种整合的强化学习机制是否能够解释清醒时海马体再激活的多样性,包括顺序的多样性(正向和反向再激活轨迹)以及它们发生位置的多样性(奖励位点或决策点)。在此,我们提出一种基于模型的双向搜索模型,该模型可以解释多种海马体再激活现象。该模型结合了从当前位置进行的正向轨迹采样以及通过从与大奖励预测误差相关的状态进行优先扫描的反向采样,直到两条轨迹连接。重复此过程直到状态-动作值稳定(收敛),这可以解释为什么当动物的表现稳定时海马体再激活会急剧减少。在多重T型迷宫任务中的模拟表明,正向再激活主要出现在决策点,而反向再激活仅在奖励位点产生。最后,该模型可以生成在任务执行期间主体不被允许的虚构轨迹。我们提出了一些实验预测以及对未来关于海马体-前额叶-纹状体网络在学习中的作用研究的启示。