• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

马尔可夫决策过程的熵正则化

Entropic Regularization of Markov Decision Processes.

作者信息

Belousov Boris, Peters Jan

机构信息

Department of Computer Science, Technische Universität Darmstadt, 64289 Darmstadt, Germany.

Max Planck Institute for Intelligent Systems, 72076 Tübingen, Germany.

出版信息

Entropy (Basel). 2019 Jul 10;21(7):674. doi: 10.3390/e21070674.

DOI:10.3390/e21070674
PMID:33267388
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7515171/
Abstract

An optimal feedback controller for a given Markov decision process (MDP) can in principle be synthesized by value or policy iteration. However, if the system dynamics and the reward function are unknown, a learning agent must discover an optimal controller via direct interaction with the environment. Such interactive data gathering commonly leads to divergence towards dangerous or uninformative regions of the state space unless additional regularization measures are taken. Prior works proposed bounding the information loss measured by the Kullback-Leibler (KL) divergence at every policy improvement step to eliminate instability in the learning dynamics. In this paper, we consider a broader family of -divergences, and more concretely α -divergences, which inherit the beneficial property of providing the policy improvement step in closed form at the same time yielding a corresponding dual objective for policy evaluation. Such entropic proximal policy optimization view gives a unified perspective on compatible actor-critic architectures. In particular, common least-squares value function estimation coupled with advantage-weighted maximum likelihood policy improvement is shown to correspond to the Pearson χ 2 -divergence penalty. Other actor-critic pairs arise for various choices of the penalty-generating function . On a concrete instantiation of our framework with the α -divergence, we carry out asymptotic analysis of the solutions for different values of α and demonstrate the effects of the divergence function choice on common standard reinforcement learning problems.

摘要

原则上,给定马尔可夫决策过程(MDP)的最优反馈控制器可以通过值迭代或策略迭代来合成。然而,如果系统动态和奖励函数未知,学习智能体必须通过与环境的直接交互来发现最优控制器。除非采取额外的正则化措施,这种交互式数据收集通常会导致朝着状态空间的危险或无信息区域发散。先前的工作提出在每个策略改进步骤中限制用库尔贝克-莱布勒(KL)散度衡量的信息损失,以消除学习动态中的不稳定性。在本文中,我们考虑更广泛的 -散度族,具体而言是α -散度,它继承了以封闭形式提供策略改进步骤的有益特性,同时为策略评估产生相应的对偶目标。这种熵近端策略优化观点为兼容的演员-评论家架构提供了统一的视角。特别地,常见的最小二乘价值函数估计与最大似然策略改进相结合,被证明对应于皮尔逊χ2 -散度惩罚。对于惩罚生成函数的各种选择,会出现其他演员-评论家对。在我们具有α -散度的框架的具体实例化中,我们对不同α值的解进行渐近分析,并展示散度函数选择对常见标准强化学习问题的影响。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c86/7515171/78b4391168fd/entropy-21-00674-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c86/7515171/bc09d29fea60/entropy-21-00674-g0A1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c86/7515171/cee08f35ff55/entropy-21-00674-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c86/7515171/cb161af99fd2/entropy-21-00674-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c86/7515171/78b4391168fd/entropy-21-00674-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c86/7515171/bc09d29fea60/entropy-21-00674-g0A1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c86/7515171/cee08f35ff55/entropy-21-00674-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c86/7515171/cb161af99fd2/entropy-21-00674-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c86/7515171/78b4391168fd/entropy-21-00674-g003.jpg

相似文献

1
Entropic Regularization of Markov Decision Processes.马尔可夫决策过程的熵正则化
Entropy (Basel). 2019 Jul 10;21(7):674. doi: 10.3390/e21070674.
2
A Maximum Divergence Approach to Optimal Policy in Deep Reinforcement Learning.深度强化学习中最优策略的最大散度方法。
IEEE Trans Cybern. 2023 Mar;53(3):1499-1510. doi: 10.1109/TCYB.2021.3104612. Epub 2023 Feb 15.
3
Actor-Critic Learning Control With Regularization and Feature Selection in Policy Gradient Estimation.策略梯度估计中具有正则化和特征选择的演员-评论家学习控制
IEEE Trans Neural Netw Learn Syst. 2021 Mar;32(3):1217-1227. doi: 10.1109/TNNLS.2020.2981377. Epub 2021 Mar 1.
4
Forward and inverse reinforcement learning sharing network weights and hyperparameters.正向和反向强化学习共享网络权重和超参数。
Neural Netw. 2021 Dec;144:138-153. doi: 10.1016/j.neunet.2021.08.017. Epub 2021 Aug 20.
5
Optimistic reinforcement learning by forward Kullback-Leibler divergence optimization.基于前向 Kullback-Leibler 散度优化的乐观强化学习。
Neural Netw. 2022 Aug;152:169-180. doi: 10.1016/j.neunet.2022.04.021. Epub 2022 Apr 21.
6
Actor-Critic Learning Control Based on -Regularized Temporal-Difference Prediction With Gradient Correction.基于带梯度校正的正则化时间差分预测的演员-评论家学习控制
IEEE Trans Neural Netw Learn Syst. 2018 Dec;29(12):5899-5909. doi: 10.1109/TNNLS.2018.2808203. Epub 2018 Apr 5.
7
Boosting On-Policy Actor-Critic With Shallow Updates in Critic.通过在评论家网络中进行浅层更新来增强策略上的演员-评论家算法
IEEE Trans Neural Netw Learn Syst. 2025 Mar;36(3):5644-5653. doi: 10.1109/TNNLS.2024.3378913. Epub 2025 Feb 28.
8
Asynchronous learning for actor-critic neural networks and synchronous triggering for multiplayer system.异步学习的演员-批评神经网络和同步触发的多人系统。
ISA Trans. 2022 Oct;129(Pt B):295-308. doi: 10.1016/j.isatra.2022.02.007. Epub 2022 Feb 10.
9
Learn Quasi-Stationary Distributions of Finite State Markov Chain.学习有限状态马尔可夫链的准平稳分布。
Entropy (Basel). 2022 Jan 17;24(1):133. doi: 10.3390/e24010133.
10
Meta attention for Off-Policy Actor-Critic.用于离策略演员-评论家的元注意力机制
Neural Netw. 2023 Jun;163:86-96. doi: 10.1016/j.neunet.2023.03.024. Epub 2023 Mar 28.

引用本文的文献

1
Information-Theoretic Cost-Benefit Analysis of Hybrid Decision Workflows in Finance.金融领域混合决策工作流的信息论成本效益分析
Entropy (Basel). 2025 Jul 23;27(8):780. doi: 10.3390/e27080780.
2
Co-Evolution of Predator-Prey Ecosystems by Reinforcement Learning Agents.强化学习智能体驱动的捕食者-猎物生态系统的协同进化
Entropy (Basel). 2021 Apr 13;23(4):461. doi: 10.3390/e23040461.
3
Co-Training for Visual Object Recognition Based on Self-Supervised Models Using a Cross-Entropy Regularization.基于使用交叉熵正则化的自监督模型的视觉目标识别协同训练

本文引用的文献

1
An Elementary Introduction to Information Geometry.信息几何基础导论。
Entropy (Basel). 2020 Sep 29;22(10):1100. doi: 10.3390/e22101100.
2
An information-theoretic approach to curiosity-driven reinforcement learning.一种用于好奇心驱动强化学习的信息论方法。
Theory Biosci. 2012 Sep;131(3):139-48. doi: 10.1007/s12064-011-0142-z. Epub 2012 Jul 12.
3
Autonomy: an information theoretic perspective.自主性:一种信息论视角
Entropy (Basel). 2021 Apr 1;23(4):423. doi: 10.3390/e23040423.
Biosystems. 2008 Feb;91(2):331-45. doi: 10.1016/j.biosystems.2007.05.018. Epub 2007 Aug 11.