Suppr超能文献

复杂环境下具有修正惩罚因子的正确近端策略优化算法的相对熵

Relative Entropy of Correct Proximal Policy Optimization Algorithms with Modified Penalty Factor in Complex Environment.

作者信息

Chen Weimin, Wong Kelvin Kian Loong, Long Sifan, Sun Zhili

机构信息

School of Information and Electronics, Hunan City University, Yiyang 413000, China.

School of Computer Science and Engineering, Central South University, Changsha 410075, China.

出版信息

Entropy (Basel). 2022 Mar 22;24(4):440. doi: 10.3390/e24040440.

Abstract

In the field of reinforcement learning, we propose a Correct Proximal Policy Optimization (CPPO) algorithm based on the modified penalty factor and relative entropy in order to solve the robustness and stationarity of traditional algorithms. Firstly, In the process of reinforcement learning, this paper establishes a strategy evaluation mechanism through the policy distribution function. Secondly, the state space function is quantified by introducing entropy, whereby the approximation policy is used to approximate the real policy distribution, and the kernel function estimation and calculation of relative entropy is used to fit the reward function based on complex problem. Finally, through the comparative analysis on the classic test cases, we demonstrated that our proposed algorithm is effective, has a faster convergence speed and better performance than the traditional PPO algorithm, and the measure of the relative entropy can show the differences. In addition, it can more efficiently use the information of complex environment to learn policies. At the same time, not only can our paper explain the rationality of the policy distribution theory, the proposed framework can also balance between iteration steps, computational complexity and convergence speed, and we also introduced an effective measure of performance using the relative entropy concept.

摘要

在强化学习领域,我们提出了一种基于修正惩罚因子和相对熵的正确近端策略优化(CPPO)算法,以解决传统算法的鲁棒性和平稳性问题。首先,在强化学习过程中,本文通过策略分布函数建立了策略评估机制。其次,通过引入熵对状态空间函数进行量化,从而使用近似策略来逼近真实策略分布,并基于复杂问题利用相对熵的核函数估计和计算来拟合奖励函数。最后,通过对经典测试案例的对比分析,我们证明了所提出的算法是有效的,与传统PPO算法相比具有更快的收敛速度和更好的性能,并且相对熵的度量能够体现差异。此外,它能够更高效地利用复杂环境的信息来学习策略。同时,本文不仅能够解释策略分布理论的合理性,所提出的框架还能在迭代步数、计算复杂度和收敛速度之间取得平衡,并且我们还引入了一种使用相对熵概念的有效性能度量方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2b67/9031020/8c7421e428d7/entropy-24-00440-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验