强化学习中的人类策略适应类似于策略梯度上升。

Human Strategy Adaptation in Reinforcement Learning Resembles Policy Gradient Ascent.

作者信息

Xiong Hua-Dong, Ji-An Li, Wilson Robert C, Mattar Marcelo G

机构信息

School of Psychology, Georgia Institute of Technology.

Neurosciences Graduate Program, University of California San Diego.

出版信息

bioRxiv. 2025 Jul 31:2025.07.28.667308. doi: 10.1101/2025.07.28.667308.

DOI:10.1101/2025.07.28.667308

PMID:40766719

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12324363/

Abstract

A hallmark of intelligence is the ability to adapt behavior to changing environments, which requires adapting one's own learning strategies. This phenomenon is known as learning to learn in cognitive science and meta-learning in artificial intelligence. While this phenomenon is well-established in humans and animals, no quantitative framework exists for characterizing the trajectories through which biological agents adapt their learning strategies. Previous computational studies that either assume fixed strategies or use task-optimized neural networks do not explain how humans refine strategies through experience. Here we show that humans adjust their reinforcement learning strategies resembling principles of gradient-based online optimization. We introduce DynamicRL, a framework using neural networks to track how participants' learning parameters (e.g., learning rates and decision temperatures) evolve throughout experiments. Across four diverse bandit tasks, DynamicRL consistently outperforms traditional reinforcement learning models with fixed parameters, demonstrating that humans continuously adapt their strategies over time. These dynamically-estimated parameters reveal trajectories that systematically increase expected rewards, with updates significantly aligned with policy gradient ascent directions. Furthermore, this learning process operates across multiple timescales, with strategy parameters updating more slowly than behavioral choices, and update effectiveness correlates with local gradient strength in the reward landscape. Our work offers a generalizable approach for characterizing meta-learning trajectories, bridging theories of biological and artificial intelligence by providing a quantitative method for studying how adaptive behavior is optimized through experience.

摘要

智能的一个标志是能够使行为适应不断变化的环境，这需要调整自己的学习策略。这种现象在认知科学中被称为学会学习，在人工智能中被称为元学习。虽然这种现象在人类和动物中已得到充分证实，但目前还没有定量框架来描述生物主体调整其学习策略的轨迹。以前的计算研究要么假设策略固定，要么使用任务优化的神经网络，都无法解释人类如何通过经验优化策略。在这里，我们表明人类调整他们的强化学习策略，类似于基于梯度的在线优化原则。我们引入了DynamicRL，这是一个使用神经网络来跟踪参与者的学习参数（例如，学习率和决策温度）在整个实验过程中如何演变的框架。在四个不同的强盗任务中，DynamicRL始终优于具有固定参数的传统强化学习模型，表明人类会随着时间不断调整他们的策略。这些动态估计的参数揭示了系统地增加预期奖励的轨迹，其更新与策略梯度上升方向显著一致。此外，这种学习过程在多个时间尺度上运行，策略参数的更新比行为选择更慢，并且更新效果与奖励格局中的局部梯度强度相关。我们的工作提供了一种可推广的方法来描述元学习轨迹，通过提供一种定量方法来研究适应性行为如何通过经验得到优化，从而架起了生物和人工智能理论之间的桥梁。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ed7/12324363/853a01d4e6c7/nihpp-2025.07.28.667308v1-f0001.jpg

相似文献

Human Strategy Adaptation in Reinforcement Learning Resembles Policy Gradient Ascent.

bioRxiv. 2025 Jul 31:2025.07.28.667308. doi: 10.1101/2025.07.28.667308.

Prescription of Controlled Substances: Benefits and Risks

Stigma Management Strategies of Autistic Social Media Users.

Autism Adulthood. 2025 May 28;7(3):273-282. doi: 10.1089/aut.2023.0095. eCollection 2025 Jun.

How lived experiences of illness trajectories, burdens of treatment, and social inequalities shape service user and caregiver participation in health and social care: a theory-informed qualitative evidence synthesis.

Health Soc Care Deliv Res. 2025 Jun;13(24):1-120. doi: 10.3310/HGTQ8159.

The roles of online and offline replay in planning.

Elife. 2020 Jun 17;9:e56911. doi: 10.7554/eLife.56911.

Q-learning with temporal memory to navigate turbulence.

Elife. 2025 Jul 21;13:RP102906. doi: 10.7554/eLife.102906.

The Lived Experience of Autistic Adults in Employment: A Systematic Search and Synthesis.

Autism Adulthood. 2024 Dec 2;6(4):495-509. doi: 10.1089/aut.2022.0114. eCollection 2024 Dec.

Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.

Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.

Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.

Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.

Adapting Safety Plans for Autistic Adults with Involvement from the Autism Community.

Autism Adulthood. 2025 May 28;7(3):293-302. doi: 10.1089/aut.2023.0124. eCollection 2025 Jun.

本文引用的文献

Discovering cognitive strategies with tiny recurrent neural networks.

Nature. 2025 Jul 2. doi: 10.1038/s41586-025-09142-4.

Harnessing the flexibility of neural networks to predict dynamic theoretical parameters underlying human choice behavior.

PLoS Comput Biol. 2024 Jan 4;20(1):e1011678. doi: 10.1371/journal.pcbi.1011678. eCollection 2024 Jan.

Meta-reinforcement learning via orbitofrontal cortex.

Nat Neurosci. 2023 Dec;26(12):2182-2191. doi: 10.1038/s41593-023-01485-3. Epub 2023 Nov 13.

Neural superstatistics for Bayesian estimation of dynamic cognitive models.

Sci Rep. 2023 Aug 23;13(1):13778. doi: 10.1038/s41598-023-40278-3.

Heuristics from bounded meta-learned inference.

Psychol Rev. 2022 Oct;129(5):1042-1077. doi: 10.1037/rev0000330. Epub 2022 Jan 6.

A model for learning based on the joint estimation of stochasticity and volatility.

Nat Commun. 2021 Nov 15;12(1):6587. doi: 10.1038/s41467-021-26731-9.

Paranoia and belief updating during the COVID-19 crisis.

Nat Hum Behav. 2021 Sep;5(9):1190-1202. doi: 10.1038/s41562-021-01176-8. Epub 2021 Jul 27.

Differential Effects of Psychotic Illness on Directed and Random Exploration.

Comput Psychiatr. 2020;4:18-39. doi: 10.1162/cpsy_a_00027. Epub 2020 Aug 1.

The dynamics of explore-exploit decisions reveal a signal-to-noise mechanism for random exploration.

Sci Rep. 2021 Feb 4;11(1):3077. doi: 10.1038/s41598-021-82530-8.

A simple model for learning in volatile environments.

PLoS Comput Biol. 2020 Jul 1;16(7):e1007963. doi: 10.1371/journal.pcbi.1007963. eCollection 2020 Jul.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

强化学习中的人类策略适应类似于策略梯度上升。

Human Strategy Adaptation in Reinforcement Learning Resembles Policy Gradient Ascent.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献