行动者-评论家学习是匹配法则背后的原理：匹配行为与最优行为。

The actor-critic learning is behind the matching law: matching versus optimal behaviors.

作者信息

Sakai Yutaka, Fukai Tomoki

机构信息

Department of Intelligent Information Systems, Tamagawa University, Machida, Tokyo 194-8610, Japan.

出版信息

Neural Comput. 2008 Jan;20(1):227-51. doi: 10.1162/neco.2008.20.1.227.

DOI:10.1162/neco.2008.20.1.227

PMID:18045007

Abstract

The ability to make a correct choice of behavior from various options is crucial for animals' survival. The neural basis for the choice of behavior has been attracting growing attention in research on biological and artificial neural systems. Alternative choice tasks with variable ratio (VR) and variable interval (VI) schedules of reinforcement have often been employed in studying decision making by animals and humans. In the VR schedule task, alternative choices are reinforced with different probabilities, and subjects learn to select the behavioral response rewarded more frequently. In the VI schedule task, alternative choices are reinforced at different average intervals independent of the choice frequencies, and the choice behavior follows the so-called matching law. The two policies appear robustly in subjects' choice of behavior, but the underlying neural mechanisms remain unknown. Here, we show that these seemingly different policies can appear from a common computational algorithm known as actor-critic learning. We present experimentally testable variations of the VI schedule in which the matching behavior gives only a suboptimal solution to decision making and show that the actor-critic system exhibits the matching behavior in the steady state of the learning even when the matching behavior is suboptimal. However, it is found that the matching behavior can earn approximately the same reward as the optimal one in many practical situations.

摘要

从各种选项中做出正确行为选择的能力对动物的生存至关重要。行为选择的神经基础在生物和人工神经系统研究中一直备受关注。具有可变比率（VR）和可变间隔（VI）强化程序的交替选择任务经常被用于研究动物和人类的决策。在VR程序任务中，交替选择以不同概率得到强化，受试者学会选择更频繁得到奖励的行为反应。在VI程序任务中，交替选择在与选择频率无关的不同平均间隔得到强化，选择行为遵循所谓的匹配法则。这两种策略在受试者的行为选择中表现得很稳健，但其潜在的神经机制仍然未知。在这里，我们表明，这些看似不同的策略可以从一种称为行为-评判学习的通用计算算法中出现。我们提出了VI程序的可实验测试变体，其中匹配行为对决策仅给出次优解决方案，并表明行为-评判系统在学习的稳态中表现出匹配行为，即使匹配行为是次优的。然而，发现在许多实际情况下，匹配行为可以获得与最优行为大致相同的奖励。

相似文献

The actor-critic learning is behind the matching law: matching versus optimal behaviors.行动者-评论家学习是匹配法则背后的原理：匹配行为与最优行为。

Neural Comput. 2008 Jan;20(1):227-51. doi: 10.1162/neco.2008.20.1.227.

Statistical mechanics of reward-modulated learning in decision-making networks.决策网络中受奖励调节的学习的统计力学。

Neural Comput. 2012 May;24(5):1230-70. doi: 10.1162/NECO_a_00264. Epub 2012 Feb 1.

Operant matching as a Nash equilibrium of an intertemporal game.作为跨期博弈纳什均衡的操作性匹配

Neural Comput. 2009 Oct;21(10):2755-73. doi: 10.1162/neco.2009.09-08-854.

Reinforcement learning and decision making in monkeys during a competitive game.猴子在竞争性游戏中的强化学习与决策

Brain Res Cogn Brain Res. 2004 Dec;22(1):45-58. doi: 10.1016/j.cogbrainres.2004.07.007.

A spiking neural network model of an actor-critic learning agent.一种基于演员-评论家学习智能体的脉冲神经网络模型。

Neural Comput. 2009 Feb;21(2):301-39. doi: 10.1162/neco.2008.08-07-593.

Model-based reinforcement learning under concurrent schedules of reinforcement in rodents.啮齿动物在并发强化程序下基于模型的强化学习

Learn Mem. 2009 Apr 29;16(5):315-23. doi: 10.1101/lm.1295509. Print 2009 May.

Integration of reinforcement learning and optimal decision-making theories of the basal ganglia.整合强化学习与基底神经节的最优决策理论。

Neural Comput. 2011 Apr;23(4):817-51. doi: 10.1162/NECO_a_00103. Epub 2011 Jan 11.

Reward-dependent learning in neuronal networks for planning and decision making.用于规划和决策的神经网络中基于奖励的学习。

Prog Brain Res. 2000;126:217-29. doi: 10.1016/S0079-6123(00)26016-0.

[Mathematical models of decision making and learning].[决策与学习的数学模型]

Brain Nerve. 2008 Jul;60(7):791-8.

A model of hippocampally dependent navigation, using the temporal difference learning rule.一种使用时间差分学习规则的海马体依赖性导航模型。

Hippocampus. 2000;10(1):1-16. doi: 10.1002/(SICI)1098-1063(2000)10:1<1::AID-HIPO1>3.0.CO;2-1.

引用本文的文献

Stimulus uncertainty and relative reward rates determine adaptive responding in perceptual decision-making.刺激不确定性和相对奖励率决定了知觉决策中的适应性反应。

PLoS Comput Biol. 2025 May 27;21(5):e1012636. doi: 10.1371/journal.pcbi.1012636. eCollection 2025 May.

Undermatching Is a Consequence of Policy Compression.政策压缩导致不匹配。

J Neurosci. 2023 Jan 18;43(3):447-457. doi: 10.1523/JNEUROSCI.1003-22.2022. Epub 2022 Dec 6.

Value-free reinforcement learning: policy optimization as a minimal model of operant behavior.无价值强化学习：作为操作性行为最小模型的策略优化

Curr Opin Behav Sci. 2021 Oct;41:114-121. doi: 10.1016/j.cobeha.2021.04.020. Epub 2021 May 28.

Choice history effects in mice and humans improve reward harvesting efficiency.在老鼠和人类中，选择历史效应可提高奖励收获效率。

PLoS Comput Biol. 2021 Oct 4;17(10):e1009452. doi: 10.1371/journal.pcbi.1009452. eCollection 2021 Oct.

Dynamic decision making and value computations in medial frontal cortex.内侧前额叶皮层的动态决策和价值计算。

Int Rev Neurobiol. 2021;158:83-113. doi: 10.1016/bs.irn.2020.12.001. Epub 2021 Jan 23.

Simulating bout-and-pause patterns with reinforcement learning.使用强化学习模拟回合-暂停模式。

PLoS One. 2020 Nov 12;15(11):e0242201. doi: 10.1371/journal.pone.0242201. eCollection 2020.

Deviation from the matching law reflects an optimal strategy involving learning over multiple timescales.偏离匹配律反映了一种涉及多个时间尺度的学习的最优策略。

Nat Commun. 2019 Apr 1;10(1):1466. doi: 10.1038/s41467-019-09388-3.

Optimal response vigor and choice under non-stationary outcome values.非稳定结果值下的最佳反应活力和选择。

Psychon Bull Rev. 2019 Feb;26(1):182-204. doi: 10.3758/s13423-018-1500-3.

An effect of serotonergic stimulation on learning rates for rewards apparent after long intertrial intervals.长的间隔测验后，血清素刺激对奖赏学习率的影响明显。

Nat Commun. 2018 Jun 26;9(1):2477. doi: 10.1038/s41467-018-04840-2.

Adaptive learning and decision-making under uncertainty by metaplastic synapses guided by a surprise detection system.由惊喜检测系统引导的元塑性突触在不确定性下的适应性学习与决策。

Elife. 2016 Aug 9;5:e18073. doi: 10.7554/eLife.18073.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

行动者-评论家学习是匹配法则背后的原理：匹配行为与最优行为。

The actor-critic learning is behind the matching law: matching versus optimal behaviors.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献