Department of Bioengineering, Imperial College London, London, United Kingdom.
Elife. 2023 Mar 16;12:e80671. doi: 10.7554/eLife.80671.
The hippocampus has been proposed to encode environments using a representation that contains predictive information about likely future states, called the successor representation. However, it is not clear how such a representation could be learned in the hippocampal circuit. Here, we propose a plasticity rule that can learn this predictive map of the environment using a spiking neural network. We connect this biologically plausible plasticity rule to reinforcement learning, mathematically and numerically showing that it implements the TD-lambda algorithm. By spanning these different levels, we show how our framework naturally encompasses behavioral activity and replays, smoothly moving from rate to temporal coding, and allows learning over behavioral timescales with a plasticity rule acting on a timescale of milliseconds. We discuss how biological parameters such as dwelling times at states, neuronal firing rates and neuromodulation relate to the delay discounting parameter of the TD algorithm, and how they influence the learned representation. We also find that, in agreement with psychological studies and contrary to reinforcement learning theory, the discount factor decreases hyperbolically with time. Finally, our framework suggests a role for replays, in both aiding learning in novel environments and finding shortcut trajectories that were not experienced during behavior, in agreement with experimental data.
海马体被认为使用一种包含关于可能未来状态的预测信息的表示来对环境进行编码,这种表示被称为后继表示。然而,目前尚不清楚海马体回路中如何学习这种表示。在这里,我们提出了一种可塑性规则,该规则可以使用脉冲神经网络来学习环境的这种预测图。我们将这种具有生物学意义的可塑性规则与强化学习联系起来,从数学和数值上证明它实现了 TD-lambda 算法。通过跨越这些不同的层次,我们展示了我们的框架如何自然地包含行为活动和重放,从率码平滑地过渡到时间编码,并允许使用作用于毫秒时间尺度的可塑性规则在行为时间尺度上进行学习。我们讨论了生物参数(例如状态下的停留时间、神经元放电率和神经调制)如何与 TD 算法的延迟折扣参数相关,以及它们如何影响学习到的表示。我们还发现,与强化学习理论相反,与心理研究一致,折扣因子随时间呈双曲线下降。最后,我们的框架表明重放在帮助学习新环境和找到在行为过程中未经历过的捷径轨迹方面发挥作用,这与实验数据一致。