学习最大化奖励率：基于半马尔可夫决策过程的模型。

Learning to maximize reward rate: a model based on semi-Markov decision processes.

机构信息

Department of Psychological and Brain Sciences, Indiana University Bloomington, IN, USA.

出版信息

Front Neurosci. 2014 May 23;8:101. doi: 10.3389/fnins.2014.00101. eCollection 2014.

DOI:10.3389/fnins.2014.00101

PMID:24904252

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4033239/

Abstract

WHEN ANIMALS HAVE TO MAKE A NUMBER OF DECISIONS DURING A LIMITED TIME INTERVAL, THEY FACE A FUNDAMENTAL PROBLEM: how much time they should spend on each decision in order to achieve the maximum possible total outcome. Deliberating more on one decision usually leads to more outcome but less time will remain for other decisions. In the framework of sequential sampling models, the question is how animals learn to set their decision threshold such that the total expected outcome achieved during a limited time is maximized. The aim of this paper is to provide a theoretical framework for answering this question. To this end, we consider an experimental design in which each trial can come from one of the several possible "conditions." A condition specifies the difficulty of the trial, the reward, the penalty and so on. We show that to maximize the expected reward during a limited time, the subject should set a separate value of decision threshold for each condition. We propose a model of learning the optimal value of decision thresholds based on the theory of semi-Markov decision processes (SMDP). In our model, the experimental environment is modeled as an SMDP with each "condition" being a "state" and the value of decision thresholds being the "actions" taken in those states. The problem of finding the optimal decision thresholds then is cast as the stochastic optimal control problem of taking actions in each state in the corresponding SMDP such that the average reward rate is maximized. Our model utilizes a biologically plausible learning algorithm to solve this problem. The simulation results show that at the beginning of learning the model choses high values of decision threshold which lead to sub-optimal performance. With experience, however, the model learns to lower the value of decision thresholds till finally it finds the optimal values.

摘要

当动物在有限的时间内必须做出多项决策时，它们会面临一个基本问题：为了获得最大的总结果，它们应该在每个决策上花费多少时间。在一个决策上思考得越多，通常会带来更多的结果，但留给其他决策的时间就越少。在顺序抽样模型的框架内，问题是动物如何学会设置决策阈值，以便在有限的时间内实现最大的总预期结果。本文的目的是提供一个理论框架来回答这个问题。为此，我们考虑了一种实验设计，其中每个试验可以来自几种可能的“条件”之一。条件指定了试验的难度、奖励、惩罚等。我们表明，为了在有限的时间内最大化预期奖励，主体应该为每个条件设置单独的决策阈值值。我们提出了一种基于半马尔可夫决策过程（SMDP）理论学习最优决策阈值值的模型。在我们的模型中，实验环境被建模为一个具有每个“条件”为一个“状态”和决策阈值值为在这些状态中采取的“动作”的 SMDP。然后，找到最优决策阈值值的问题被表述为在相应 SMDP 中的每个状态中采取行动的随机最优控制问题，以使平均奖励率最大化。我们的模型利用一种合理的学习算法来解决这个问题。模拟结果表明，在学习的开始阶段，模型选择高的决策阈值值，这导致次优的性能。然而，随着经验的积累，模型学会降低决策阈值值，直到最终找到最优值。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fb21/4033239/8882fe5ac7b7/fnins-08-00101-g0001.jpg

相似文献

Learning to maximize reward rate: a model based on semi-Markov decision processes.学习最大化奖励率：基于半马尔可夫决策过程的模型。

Front Neurosci. 2014 May 23;8:101. doi: 10.3389/fnins.2014.00101. eCollection 2014.

Learning to allocate limited time to decisions with different expected outcomes.学会为具有不同预期结果的决策分配有限的时间。

Cogn Psychol. 2017 Jun;95:17-49. doi: 10.1016/j.cogpsych.2017.03.002. Epub 2017 Apr 19.

Decision making under uncertainty: a neural model based on partially observable markov decision processes.不确定性下的决策：基于部分可观察马尔可夫决策过程的神经模型。

Front Comput Neurosci. 2010 Nov 24;4:146. doi: 10.3389/fncom.2010.00146. eCollection 2010.

Semi-Infinitely Constrained Markov Decision Processes and Provably Efficient Reinforcement Learning.半无限约束马尔可夫决策过程与可证明的高效强化学习

IEEE Trans Pattern Anal Mach Intell. 2024 May;46(5):3722-3735. doi: 10.1109/TPAMI.2023.3348460. Epub 2024 Apr 3.

BATCH POLICY LEARNING IN AVERAGE REWARD MARKOV DECISION PROCESSES.平均奖励马尔可夫决策过程中的批量策略学习

Ann Stat. 2022 Dec;50(6):3364-3387. doi: 10.1214/22-aos2231. Epub 2022 Dec 21.

Goal-oriented inference of environment from redundant observations.从冗余观测中进行面向目标的环境推断。

Neural Netw. 2024 Jun;174:106246. doi: 10.1016/j.neunet.2024.106246. Epub 2024 Mar 15.

Strategically managing learning during perceptual decision making.战略性地管理感知决策中的学习。

Elife. 2023 Feb 14;12:e64978. doi: 10.7554/eLife.64978.

Cognitive models of optimal sequential search with recall.具有回忆功能的最优序贯搜索的认知模型。

Cognition. 2021 May;210:104595. doi: 10.1016/j.cognition.2021.104595. Epub 2021 Jan 21.

Reward optimization in the primate brain: a probabilistic model of decision making under uncertainty.灵长类动物大脑中的奖励优化：不确定条件下决策的概率模型。

PLoS One. 2013;8(1):e53344. doi: 10.1371/journal.pone.0053344. Epub 2013 Jan 22.

Quantile Markov Decision Processes.分位数马尔可夫决策过程

Oper Res. 2022 May-Jun;70(3):1428-1447. doi: 10.1287/opre.2021.2123. Epub 2021 Nov 9.

引用本文的文献

People are at least as good at optimizing reward rate under equivalent fixed-trial compared to fixed-time conditions.与固定时间条件相比，在等效的固定试验条件下，人们至少同样擅长优化奖励率。

Psychon Bull Rev. 2025 Apr 3. doi: 10.3758/s13423-025-02680-y.

Influence of rhythmic-movement activity intervention on hot executive function of 5- to 6-year-old children.节奏运动活动干预对5至6岁儿童热执行功能的影响。

Front Psychol. 2024 Mar 1;15:1291353. doi: 10.3389/fpsyg.2024.1291353. eCollection 2024.

Bounded rational decision-making models suggest capacity-limited concurrent motor planning in human posterior parietal and frontal cortex.有界理性决策模型表明，人类顶后和额皮质存在容量有限的并发运动规划。

PLoS Comput Biol. 2022 Oct 13;18(10):e1010585. doi: 10.1371/journal.pcbi.1010585. eCollection 2022 Oct.

Setting the space for deliberation in decision-making.在决策过程中为审议留出空间。

Cogn Neurodyn. 2021 Oct;15(5):743-755. doi: 10.1007/s11571-021-09681-2. Epub 2021 Apr 21.

Delays to Reward Delivery Enhance the Preference for an Initially Less Desirable Option: Role for the Basolateral Amygdala and Retrosplenial Cortex.延迟奖励传递会增强对最初不太理想选项的偏好：外侧杏仁核和后扣带回皮层的作用。

J Neurosci. 2021 Sep 1;41(35):7461-7478. doi: 10.1523/JNEUROSCI.0438-21.2021. Epub 2021 Jul 27.

Toward a unified view of the speed-accuracy trade-off.迈向速度 - 准确性权衡的统一观点。

Front Neurosci. 2015 Apr 28;9:139. doi: 10.3389/fnins.2015.00139. eCollection 2015.

Time-varying boundaries for diffusion models of decision making and response time.决策和反应时扩散模型的时变边界。

Front Psychol. 2014 Dec 9;5:1364. doi: 10.3389/fpsyg.2014.01364. eCollection 2014.

本文引用的文献

Reinforcement-based decision making in corticostriatal circuits: mutual constraints by neurocomputational and diffusion models.基于强化的皮质纹状体回路决策：神经计算和扩散模型的相互约束。

Neural Comput. 2012 May;24(5):1186-229. doi: 10.1162/NECO_a_00270. Epub 2012 Feb 1.

Rational decision-making in inhibitory control.抑制控制中的理性决策。

Front Hum Neurosci. 2011 May 27;5:48. doi: 10.3389/fnhum.2011.00048. eCollection 2011.

Acquisition of decision making criteria: reward rate ultimately beats accuracy.决策标准的获取：奖励率最终胜过准确性。

Atten Percept Psychophys. 2011 Feb;73(2):640-57. doi: 10.3758/s13414-010-0049-7.

Integration of reinforcement learning and optimal decision-making theories of the basal ganglia.整合强化学习与基底神经节的最优决策理论。

Neural Comput. 2011 Apr;23(4):817-51. doi: 10.1162/NECO_a_00103. Epub 2011 Jan 11.

Decision making under uncertainty: a neural model based on partially observable markov decision processes.不确定性下的决策：基于部分可观察马尔可夫决策过程的神经模型。

Front Comput Neurosci. 2010 Nov 24;4:146. doi: 10.3389/fncom.2010.00146. eCollection 2010.

Cortico-striatal connections predict control over speed and accuracy in perceptual decision making.皮质纹状体连接可预测感知决策中对速度和准确性的控制。

Proc Natl Acad Sci U S A. 2010 Sep 7;107(36):15916-20. doi: 10.1073/pnas.1004932107. Epub 2010 Aug 23.

Reward rate optimization in two-alternative decision making: empirical tests of theoretical predictions.双选择决策中的奖励率优化：理论预测的实证检验。

J Exp Psychol Hum Percept Perform. 2009 Dec;35(6):1865-97. doi: 10.1037/a0016926.

The neural basis of the speed-accuracy tradeoff.速度-准确性权衡的神经基础。

Trends Neurosci. 2010 Jan;33(1):10-6. doi: 10.1016/j.tins.2009.09.002. Epub 2009 Oct 8.

Reinforcement learning can account for associative and perceptual learning on a visual-decision task.强化学习可以解释视觉决策任务中的联想学习和感知学习。

Nat Neurosci. 2009 May;12(5):655-63. doi: 10.1038/nn.2304. Epub 2009 Apr 19.

A Diffusion Model Account of Criterion Shifts in the Lexical Decision Task.词汇判断任务中标准转移的扩散模型解释

J Mem Lang. 2008 Jan;58(1):140-159. doi: 10.1016/j.jml.2007.04.006.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

学习最大化奖励率：基于半马尔可夫决策过程的模型。

Learning to maximize reward rate: a model based on semi-Markov decision processes.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献