• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

复杂环境下具有修正惩罚因子的正确近端策略优化算法的相对熵

Relative Entropy of Correct Proximal Policy Optimization Algorithms with Modified Penalty Factor in Complex Environment.

作者信息

Chen Weimin, Wong Kelvin Kian Loong, Long Sifan, Sun Zhili

机构信息

School of Information and Electronics, Hunan City University, Yiyang 413000, China.

School of Computer Science and Engineering, Central South University, Changsha 410075, China.

出版信息

Entropy (Basel). 2022 Mar 22;24(4):440. doi: 10.3390/e24040440.

DOI:10.3390/e24040440
PMID:35455103
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9031020/
Abstract

In the field of reinforcement learning, we propose a Correct Proximal Policy Optimization (CPPO) algorithm based on the modified penalty factor and relative entropy in order to solve the robustness and stationarity of traditional algorithms. Firstly, In the process of reinforcement learning, this paper establishes a strategy evaluation mechanism through the policy distribution function. Secondly, the state space function is quantified by introducing entropy, whereby the approximation policy is used to approximate the real policy distribution, and the kernel function estimation and calculation of relative entropy is used to fit the reward function based on complex problem. Finally, through the comparative analysis on the classic test cases, we demonstrated that our proposed algorithm is effective, has a faster convergence speed and better performance than the traditional PPO algorithm, and the measure of the relative entropy can show the differences. In addition, it can more efficiently use the information of complex environment to learn policies. At the same time, not only can our paper explain the rationality of the policy distribution theory, the proposed framework can also balance between iteration steps, computational complexity and convergence speed, and we also introduced an effective measure of performance using the relative entropy concept.

摘要

在强化学习领域,我们提出了一种基于修正惩罚因子和相对熵的正确近端策略优化(CPPO)算法,以解决传统算法的鲁棒性和平稳性问题。首先,在强化学习过程中,本文通过策略分布函数建立了策略评估机制。其次,通过引入熵对状态空间函数进行量化,从而使用近似策略来逼近真实策略分布,并基于复杂问题利用相对熵的核函数估计和计算来拟合奖励函数。最后,通过对经典测试案例的对比分析,我们证明了所提出的算法是有效的,与传统PPO算法相比具有更快的收敛速度和更好的性能,并且相对熵的度量能够体现差异。此外,它能够更高效地利用复杂环境的信息来学习策略。同时,本文不仅能够解释策略分布理论的合理性,所提出的框架还能在迭代步数、计算复杂度和收敛速度之间取得平衡,并且我们还引入了一种使用相对熵概念的有效性能度量方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b67/9031020/e739f043e291/entropy-24-00440-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b67/9031020/8c7421e428d7/entropy-24-00440-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b67/9031020/b99c32700807/entropy-24-00440-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b67/9031020/396b20c05590/entropy-24-00440-g003a.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b67/9031020/85d6fbfc97b8/entropy-24-00440-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b67/9031020/e739f043e291/entropy-24-00440-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b67/9031020/8c7421e428d7/entropy-24-00440-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b67/9031020/b99c32700807/entropy-24-00440-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b67/9031020/396b20c05590/entropy-24-00440-g003a.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b67/9031020/85d6fbfc97b8/entropy-24-00440-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b67/9031020/e739f043e291/entropy-24-00440-g005.jpg

相似文献

1
Relative Entropy of Correct Proximal Policy Optimization Algorithms with Modified Penalty Factor in Complex Environment.复杂环境下具有修正惩罚因子的正确近端策略优化算法的相对熵
Entropy (Basel). 2022 Mar 22;24(4):440. doi: 10.3390/e24040440.
2
An Improved Distributed Sampling PPO Algorithm Based on Beta Policy for Continuous Global Path Planning Scheme.基于贝塔策略的改进分布式采样 PPO 算法在连续全局路径规划方案中的应用。
Sensors (Basel). 2023 Jul 2;23(13):6101. doi: 10.3390/s23136101.
3
Diversity Evolutionary Policy Deep Reinforcement Learning.多样性进化策略深度强化学习。
Comput Intell Neurosci. 2021 Aug 3;2021:5300189. doi: 10.1155/2021/5300189. eCollection 2021.
4
Efficient Detection of Malicious Traffic Using a Decision Tree-Based Proximal Policy Optimisation Algorithm: A Deep Reinforcement Learning Malicious Traffic Detection Model Incorporating Entropy.使用基于决策树的近端策略优化算法高效检测恶意流量:一种结合熵的深度强化学习恶意流量检测模型
Entropy (Basel). 2024 Jul 30;26(8):648. doi: 10.3390/e26080648.
5
Authentic Boundary Proximal Policy Optimization.真实边界近端策略优化。
IEEE Trans Cybern. 2022 Sep;52(9):9428-9438. doi: 10.1109/TCYB.2021.3051456. Epub 2022 Aug 18.
6
Kernel-based least squares policy iteration for reinforcement learning.用于强化学习的基于核的最小二乘策略迭代
IEEE Trans Neural Netw. 2007 Jul;18(4):973-92. doi: 10.1109/TNN.2007.899161.
7
Deep Reinforcement Learning Microgrid Optimization Strategy Considering Priority Flexible Demand Side.考虑优先级灵活需求侧的深度强化学习微电网优化策略
Sensors (Basel). 2022 Mar 14;22(6):2256. doi: 10.3390/s22062256.
8
Actor-Critic Learning Control With Regularization and Feature Selection in Policy Gradient Estimation.策略梯度估计中具有正则化和特征选择的演员-评论家学习控制
IEEE Trans Neural Netw Learn Syst. 2021 Mar;32(3):1217-1227. doi: 10.1109/TNNLS.2020.2981377. Epub 2021 Mar 1.
9
Kernel-Based Least Squares Temporal Difference With Gradient Correction.基于核的最小二乘时间差分与梯度校正。
IEEE Trans Neural Netw Learn Syst. 2016 Apr;27(4):771-82. doi: 10.1109/TNNLS.2015.2424233. Epub 2015 May 1.
10
Model-Based Predictive Control and Reinforcement Learning for Planning Vehicle-Parking Trajectories for Vertical Parking Spaces.基于模型的预测控制与强化学习用于垂直停车位的车辆泊车轨迹规划
Sensors (Basel). 2023 Aug 11;23(16):7124. doi: 10.3390/s23167124.

本文引用的文献

1
A Novel Hybrid Approach for Partial Discharge Signal Detection Based on Complete Ensemble Empirical Mode Decomposition with Adaptive Noise and Approximate Entropy.一种基于自适应噪声的完全集合经验模态分解和近似熵的局部放电信号检测新混合方法。
Entropy (Basel). 2020 Sep 17;22(9):1039. doi: 10.3390/e22091039.
2
Mastering the game of Go without human knowledge.无需人类知识即可掌握围棋游戏。
Nature. 2017 Oct 18;550(7676):354-359. doi: 10.1038/nature24270.
3
A Robust Regression Framework with Laplace Kernel-Induced Loss.一种具有拉普拉斯核诱导损失的稳健回归框架。
Neural Comput. 2017 Nov;29(11):3014-3039. doi: 10.1162/neco_a_01002. Epub 2017 Aug 4.
4
Human-level control through deep reinforcement learning.通过深度强化学习实现人类水平的控制。
Nature. 2015 Feb 26;518(7540):529-33. doi: 10.1038/nature14236.