• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于信息价值不确定性准则的马尔可夫决策过程引导策略探索。

Guided Policy Exploration for Markov Decision Processes Using an Uncertainty-Based Value-of-Information Criterion.

出版信息

IEEE Trans Neural Netw Learn Syst. 2018 Jun;29(6):2080-2098. doi: 10.1109/TNNLS.2018.2812709.

DOI:10.1109/TNNLS.2018.2812709
PMID:29771664
Abstract

Reinforcement learning in environments with many action-state pairs is challenging. The issue is the number of episodes needed to thoroughly search the policy space. Most conventional heuristics address this search problem in a stochastic manner. This can leave large portions of the policy space unvisited during the early training stages. In this paper, we propose an uncertainty-based, information-theoretic approach for performing guided stochastic searches that more effectively cover the policy space. Our approach is based on the value of information, a criterion that provides the optimal tradeoff between expected costs and the granularity of the search process. The value of information yields a stochastic routine for choosing actions during learning that can explore the policy space in a coarse to fine manner. We augment this criterion with a state-transition uncertainty factor, which guides the search process into previously unexplored regions of the policy space. We evaluate the uncertainty-based value-of-information policies on the games Centipede and Crossy Road. Our results indicate that our approach yields better performing policies in fewer episodes than stochastic-based exploration strategies. We show that the training rate for our approach can be further improved by using the policy cross entropy to guide our criterion's hyperparameter selection.

摘要

在具有大量动作-状态对的环境中进行强化学习具有挑战性。问题在于需要多少个情节来彻底搜索策略空间。大多数传统的启发式方法以随机的方式解决这个搜索问题。这可能会导致在早期训练阶段有很大一部分策略空间未被访问。在本文中,我们提出了一种基于不确定性的信息论方法来进行有指导的随机搜索,以更有效地覆盖策略空间。我们的方法基于信息价值,这是一个在预期成本和搜索过程的粒度之间提供最优折衷的标准。信息价值为学习期间选择动作提供了一个随机例程,可以以粗到细的方式探索策略空间。我们将这个标准与状态转换不确定性因素结合起来,该因素将搜索过程引导到策略空间中以前未探索的区域。我们在游戏 Centipede 和 Crossy Road 上评估基于不确定性的信息价值策略。我们的结果表明,与基于随机的探索策略相比,我们的方法在更少的情节中产生了性能更好的策略。我们表明,通过使用策略交叉熵来指导我们的标准的超参数选择,可以进一步提高我们方法的训练速度。

相似文献

1
Guided Policy Exploration for Markov Decision Processes Using an Uncertainty-Based Value-of-Information Criterion.基于信息价值不确定性准则的马尔可夫决策过程引导策略探索。
IEEE Trans Neural Netw Learn Syst. 2018 Jun;29(6):2080-2098. doi: 10.1109/TNNLS.2018.2812709.
2
An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits.探索随机离散多臂老虎机时信息价值的分析
Entropy (Basel). 2018 Feb 28;20(3):155. doi: 10.3390/e20030155.
3
A Maximum Divergence Approach to Optimal Policy in Deep Reinforcement Learning.深度强化学习中最优策略的最大散度方法。
IEEE Trans Cybern. 2023 Mar;53(3):1499-1510. doi: 10.1109/TCYB.2021.3104612. Epub 2023 Feb 15.
4
A map of ecologically rational heuristics for uncertain strategic worlds.不确定战略世界的生态理性启发式图。
Psychol Rev. 2020 Mar;127(2):245-280. doi: 10.1037/rev0000171. Epub 2019 Nov 21.
5
Scalable approximate policies for Markov decision process models of hospital elective admissions.用于医院择期入院马尔可夫决策过程模型的可扩展近似策略
Artif Intell Med. 2014 May;61(1):21-34. doi: 10.1016/j.artmed.2014.04.001. Epub 2014 Apr 13.
6
Adaptive Optimal Control for Stochastic Multiplayer Differential Games Using On-Policy and Off-Policy Reinforcement Learning.基于策略和离策略强化学习的随机多人微分博弈自适应最优控制
IEEE Trans Neural Netw Learn Syst. 2020 Dec;31(12):5522-5533. doi: 10.1109/TNNLS.2020.2969215. Epub 2020 Nov 30.
7
Policy Search for the Optimal Control of Markov Decision Processes: A Novel Particle-Based Iterative Scheme.基于粒子的马尔可夫决策过程最优控制策略搜索:一种新的迭代方案。
IEEE Trans Cybern. 2016 Nov;46(11):2643-2655. doi: 10.1109/TCYB.2015.2483780. Epub 2015 Oct 26.
8
Reduction of Markov Chains Using a Value-of-Information-Based Approach.基于信息价值方法的马尔可夫链约简
Entropy (Basel). 2019 Mar 30;21(4):349. doi: 10.3390/e21040349.
9
Parameterized MDPs and Reinforcement Learning Problems-A Maximum Entropy Principle-Based Framework.参数化马尔可夫决策过程和强化学习问题——基于最大熵原理的框架。
IEEE Trans Cybern. 2022 Sep;52(9):9339-9351. doi: 10.1109/TCYB.2021.3102510. Epub 2022 Aug 18.
10
Evolving Robust Policy Coverage Sets in Multi-Objective Markov Decision Processes Through Intrinsically Motivated Self-Play.通过内在激励的自我博弈在多目标马尔可夫决策过程中演化稳健的策略覆盖集
Front Neurorobot. 2018 Oct 9;12:65. doi: 10.3389/fnbot.2018.00065. eCollection 2018.

引用本文的文献

1
Reduction of Markov Chains Using a Value-of-Information-Based Approach.基于信息价值方法的马尔可夫链约简
Entropy (Basel). 2019 Mar 30;21(4):349. doi: 10.3390/e21040349.
2
An Analysis of the Value of Information When Exploring Stochastic, Discrete Multi-Armed Bandits.探索随机离散多臂老虎机时信息价值的分析
Entropy (Basel). 2018 Feb 28;20(3):155. doi: 10.3390/e20030155.