Xiong Hua-Dong, Ji-An Li, Wilson Robert C, Mattar Marcelo G
School of Psychology, Georgia Institute of Technology.
Neurosciences Graduate Program, University of California San Diego.
bioRxiv. 2025 Jul 31:2025.07.28.667308. doi: 10.1101/2025.07.28.667308.
A hallmark of intelligence is the ability to adapt behavior to changing environments, which requires adapting one's own learning strategies. This phenomenon is known as learning to learn in cognitive science and meta-learning in artificial intelligence. While this phenomenon is well-established in humans and animals, no quantitative framework exists for characterizing the trajectories through which biological agents adapt their learning strategies. Previous computational studies that either assume fixed strategies or use task-optimized neural networks do not explain how humans refine strategies through experience. Here we show that humans adjust their reinforcement learning strategies resembling principles of gradient-based online optimization. We introduce DynamicRL, a framework using neural networks to track how participants' learning parameters (e.g., learning rates and decision temperatures) evolve throughout experiments. Across four diverse bandit tasks, DynamicRL consistently outperforms traditional reinforcement learning models with fixed parameters, demonstrating that humans continuously adapt their strategies over time. These dynamically-estimated parameters reveal trajectories that systematically increase expected rewards, with updates significantly aligned with policy gradient ascent directions. Furthermore, this learning process operates across multiple timescales, with strategy parameters updating more slowly than behavioral choices, and update effectiveness correlates with local gradient strength in the reward landscape. Our work offers a generalizable approach for characterizing meta-learning trajectories, bridging theories of biological and artificial intelligence by providing a quantitative method for studying how adaptive behavior is optimized through experience.
智能的一个标志是能够使行为适应不断变化的环境,这需要调整自己的学习策略。这种现象在认知科学中被称为学会学习,在人工智能中被称为元学习。虽然这种现象在人类和动物中已得到充分证实,但目前还没有定量框架来描述生物主体调整其学习策略的轨迹。以前的计算研究要么假设策略固定,要么使用任务优化的神经网络,都无法解释人类如何通过经验优化策略。在这里,我们表明人类调整他们的强化学习策略,类似于基于梯度的在线优化原则。我们引入了DynamicRL,这是一个使用神经网络来跟踪参与者的学习参数(例如,学习率和决策温度)在整个实验过程中如何演变的框架。在四个不同的强盗任务中,DynamicRL始终优于具有固定参数的传统强化学习模型,表明人类会随着时间不断调整他们的策略。这些动态估计的参数揭示了系统地增加预期奖励的轨迹,其更新与策略梯度上升方向显著一致。此外,这种学习过程在多个时间尺度上运行,策略参数的更新比行为选择更慢,并且更新效果与奖励格局中的局部梯度强度相关。我们的工作提供了一种可推广的方法来描述元学习轨迹,通过提供一种定量方法来研究适应性行为如何通过经验得到优化,从而架起了生物和人工智能理论之间的桥梁。