攀爬纤维输入的预测性奖励预测误差将模块化强化学习与监督学习相结合。

Predictive reward-prediction errors of climbing fiber inputs integrate modular reinforcement learning with supervised learning.

作者信息

Hoang Huu, Tsutsumi Shinichiro, Matsuzaki Masanori, Kano Masanobu, Toyama Keisuke, Kitamura Kazuo, Kawato Mitsuo

机构信息

Neural Information Analysis Laboratories, Advanced Telecommunications Research Institute International, Kyoto, Japan.

Laboratory for Multi-scale Biological Psychiatry, RIKEN Center for Brain Science, Saitama, Japan.

出版信息

PLoS Comput Biol. 2025 Mar 17;21(3):e1012899. doi: 10.1371/journal.pcbi.1012899. eCollection 2025 Mar.

DOI:10.1371/journal.pcbi.1012899

PMID:40096178

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11957396/

Abstract

Although the cerebellum is typically associated with supervised learning algorithms, it also exhibits extensive involvement in reward processing. In this study, we investigated the cerebellum's role in executing reinforcement learning algorithms, with a particular emphasis on essential reward-prediction errors. We employed the Q-learning model to accurately reproduce the licking responses of mice in a Go/No-go auditory-discrimination task. This method enabled the calculation of reinforcement learning variables, such as reward, predicted reward, and reward-prediction errors in each learning trial. Through tensor component analysis of two-photon Ca2+ imaging data from more than 6,000 Purkinje cells, we found that climbing fiber inputs of the two distinct components, which were specifically activated during Go and No-go cues in the learning process, showed an inverse relationship with predictive reward-prediction errors. Assuming bidirectional parallel-fiber Purkinje-cell synaptic plasticity, we constructed a cerebellar neural-network model with 5,000 spiking neurons of granule cells, Purkinje cells, cerebellar nuclei neurons, and inferior olive neurons. The network model qualitatively reproduced distinct changes in licking behaviors, climbing-fiber firing rates, and their synchronization during discrimination learning separately for Go/No-go conditions. We found that Purkinje cells in the two components could develop specific motor commands for their respective auditory cues, guided by the predictive reward-prediction errors from their climbing fiber inputs. These results indicate a possible role of context-specific actors in modular reinforcement learning, integrating with cerebellar supervised learning capabilities.

摘要

尽管小脑通常与监督学习算法相关联，但它在奖励处理方面也表现出广泛的参与。在本研究中，我们调查了小脑在执行强化学习算法中的作用，特别强调了基本奖励预测误差。我们采用Q学习模型来准确再现小鼠在Go/No-go听觉辨别任务中的舔舐反应。这种方法能够计算强化学习变量，例如每次学习试验中的奖励、预测奖励和奖励预测误差。通过对来自6000多个浦肯野细胞的双光子Ca2+成像数据进行张量成分分析，我们发现，在学习过程中，在Go和No-go提示期间被特异性激活的两个不同成分的攀爬纤维输入，与预测奖励预测误差呈反比关系。假设双向平行纤维-浦肯野细胞突触可塑性，我们构建了一个具有5000个发放神经元的小脑神经网络模型，这些神经元包括颗粒细胞、浦肯野细胞、小脑核神经元和下橄榄核神经元。该网络模型定性地再现了在Go/No-go条件下辨别学习期间舔舐行为、攀爬纤维发放率及其同步性的明显变化。我们发现，两个成分中的浦肯野细胞可以在来自其攀爬纤维输入的预测奖励预测误差的引导下，为各自的听觉提示制定特定的运动指令。这些结果表明，上下文特定因素在模块化强化学习中可能发挥作用，并与小脑的监督学习能力相结合。